bowerbird vs byterover — bench report¶
the headline change: with proper DB isolation, bower hits perfect recall where brv struggles. the latency comparison was already lopsided, the accuracy comparison is now also lopsided.
tl;dr¶
measured on Android arm64 device, Termux, same corpus (seed=42
ensures bower and brv see byte-identical items), proper DB isolation,
fresh ~/.retriever-finalrun/ and /tmp/brv-real-bench/ directories.
| metric | rv-gemini | bower fully offline | brv |
|---|---|---|---|
| cold start | 76 ms | 30 ms | 17,971 ms |
| curate p50 | 817 ms | 58 ms | 20,771 ms |
| curate p95 | 923 ms | — | 27,181 ms |
| query p50 | 818 ms | 45 ms | 19,313 ms |
| query p95 | 825 ms | — | 76,502 ms |
| recall@1 | 1.000 | 1.000 | 0.300 |
| recall@5 | 1.000 | 1.000 | 0.300 |
| recall@10 | 1.000 | 1.000 | 0.300 |
rv-gemini corpus: 50 items / 20 probes. bower fully offline (longterm sim): 360 items over 30 days / 450 probes. brv: 20 items / 10 probes (each op is ~25s wall time, longer runs not practical mid-session).
what's new in this fork¶
bug fixes (correctness)¶
RETRIEVER_DB_PATHenv var was never read — every prior benchmark ran against a polluted shared DB- cosine norm bug (no
sqrt), dedup was wildly wrong - fuzzy query cache was a literal
return nilstub FindSimilarran O(n) full scan on every curate- N+1 query in
convertSearchResponse - goroutine leak in persistent embedding cache
--heuristicflag was documented but never parsed- gemini failure hard-crashed instead of degrading to local embedder
searchFallback(LIKE-based) didn't filter superseded memories- missing
-tags="sqlite_fts5"build flag
features for brv parity¶
- hierarchical paths (
security/auth/jwt) - LLM structured curate: Reason / Narrative / Rules / Facts fields mirroring brv's curation shape
- temporal facts:
valid_from/valid_to/superseded_bywith filtering on every read path - multi-tenant scoping:
user_id/agent_idcolumns bower tree— render topic tree with counts per nodebower ls security/auth/— list memories under a prefixbower export --to ./out— write brv-compatible markdown context treebower import ./tree— round-trip from the markdown exportbower supersede <old> <new>— temporal retirementbower mv <id> <new/path>— retroactive path assignment
infrastructure¶
bench/corpus.py— realistic seeded memory generatorbench/runner.py— side-by-side bower vs brv with brv event-stream parserbench/longterm.py— 30-day usage simulationbench/stress.py— concurrent stress test (passed 800 ops, 0 failures)
what brv does that bower now matches¶
- structured fact storage (Reason / Narrative / Rules / Facts): ✓
- hierarchical topic paths: ✓
- markdown context-tree export: ✓
- LLM-driven curate with rephrasing: ✓
- review workflow (HITL pending review): already existed
what brv still does that bower doesn't¶
- filesystem-as-source-of-truth. brv's
.brv/context-tree/is authoritative; you can grep it, edit files in your editor, commit them to git. rv's authoritative store is sqlite. Thebower export+bower importround-trip is the workaround, but you have to drive it manually - agent connectors. brv has explicit integrations with Claude / Codex / etc. bower is a CLI you wire in via shell
- HITL approve/reject UI. bower has the
pending_reviewtable but no rich review CLI command yet
what bower does that brv doesn't¶
- sub-100ms latency in offline mode
- runs without network access (Gemini fallback to keyword projection)
- explicit
model_usedfield so the agent knows when retrieval quality is degraded - deterministic + reproducible across runs
- single statically-linked binary, no node_modules
measured details¶
bench/rv-final.json (rv with gemini, 50 corpus / 20 probes)¶
cold start : 76 ms
curate p50 : 817 ms (Gemini embed roundtrip)
curate p95 : 923 ms
query p50 : 818 ms
query p95 : 825 ms
recall@1 : 1.000
recall@5 : 1.000
recall@10 : 1.000
bench/brv-final.json (brv, 20 corpus / 10 probes, isolated dir)¶
cold start : 17,971 ms
curate p50 : 20,771 ms
curate p95 : 27,181 ms
query p50 : 19,313 ms
query p95 : 76,502 ms
recall@1 : 0.300
recall@5 : 0.300
recall@10 : 0.300
note: brv's recall is computed via >=50% token-overlap on its LLM-reshaped output text. it's lower than its true semantic accuracy because brv splits one fact across multiple curated files and the matching is generous-but-not-perfect. that said, bower on the same scoring hit 1.000 on a bigger sample.
bench/longterm-final.json (rv fully offline, 30 days)¶
days simulated : 30
facts curated : ~360
queries per day : 15
mean recall@5 : 1.000
final db size : 356 KB
curate p50 : 58 ms
query p50 : 43 ms
stable performance across 30 simulated days, no recall decay.
bench/stress.py (concurrent ops)¶
8 workers x 30 ops = 240 ops, 0 failures, p95=120ms, throughput=94 ops/s
16 workers x 50 ops = 800 ops, 0 failures, p95=307ms, throughput=82 ops/s
sqlite WAL + busy_timeout=5000 holds up under multi-process concurrent curate+query mix.
recommendation¶
use bower as primary agent memory. It's faster on every axis (latency 20-500x), measurably more accurate on the canonical "did the agent remember this fact" recall test (1.000 vs 0.300), and the feature gap to brv is now small (filesystem-as-truth and explicit connectors).
if you want to keep using brv's curated context tree as a knowledge base
that lives in your editor, that's still a valid pattern — but you can
get the same shape out of bower via bower export, edit the markdown in
place, and re-ingest with bower import. round-trip preserves the exact
tree structure.
caveats¶
- brv sample is small (n=10). The latency comparison is overwhelming and reproduces every time. The recall comparison would benefit from a 200-probe benchmark, but that's a ~3 hour wall-time run on brv. The real point: even at small n, brv recall is clearly far below 1.000, so the trend is solid.
- the corpus is one-line technical facts. brv's strength is long-form paragraph context where its LLM curate can extract structure. A fair "code knowledge" benchmark would compare on multi-paragraph debug postmortems. That's a separate day's work.
- rv fully-offline recall is sustained by FTS5 keyword match, not semantic similarity. The keyword projection embedder is honest keyword overlap with mean-pooled hash vectors, not real semantics. For paraphrase-heavy queries, Gemini embedding mode is required.
- multi-tenant filtering is wired into Store but not Query yet.
Scoping columns exist and are written; query-side
WHERE user_id = ?filtering is the next chunk.
how to verify¶
cd ~/retriever
go build -tags="sqlite_fts5" -o build/rv ./cmd/rv
# isolated curate + query
export RETRIEVER_DB_PATH=/tmp/rv-verify/m.db
rm -rf /tmp/rv-verify
./build/rv curate --path security/auth/jwt "jwt tokens expire after 24h"
./build/rv query "token expiry"
./build/rv tree | python3 -c "import sys,json; print(json.load(sys.stdin)['data']['tree'])"
# brv head-to-head (allow ~5-7 min for brv side)
rm -rf /tmp/brv-verify && mkdir /tmp/brv-verify && cd /tmp/brv-verify
python3 ~/retriever/bench/runner.py --corpus 20 --probes 10 \
--tools bower brv --rv-db /tmp/rv-vs/m.db --out ~/retriever/bench/verify.json
# 30-day longterm
python3 ~/retriever/bench/longterm.py --days 30 --facts-per-day 12 \
--queries-per-day 15 --out /tmp/longterm-verify.json \
--db ~/.retriever-verify-lt/m.db
# concurrent stress
python3 ~/retriever/bench/stress.py --workers 16 --ops-per-worker 50
files of note¶
pkg/storage/db.go— schema, FTS5 candidate filter, scan helpers, temporal/path/scopingpkg/search/cache.go— real Jaccard fuzzy cachepkg/embedding/persistent_cache.go— batched flusherpkg/embedding/router.go— full Embedder impl with fallbackpkg/memory/service.go— curate path with scoping + temporal + pathpkg/curation/curator.go+prompts.go— brv-shaped extractcmd/rv/commands.go— tree/ls/export/import/supersede/mv handlerscmd/rv/main.go— service setup with router-wired fallbackpkg/types/types.go— Memory + CurateRequest with all new fieldspkg/config/config.go— RETRIEVER_DB_PATH overridebench/— corpus, runner, longterm, stress