bowerbird vs byterover — bench report¶

the headline change: with proper DB isolation, bower hits perfect recall where brv struggles. the latency comparison was already lopsided, the accuracy comparison is now also lopsided.

tl;dr¶

measured on Android arm64 device, Termux, same corpus (seed=42 ensures bower and brv see byte-identical items), proper DB isolation, fresh ~/.retriever-finalrun/ and /tmp/brv-real-bench/ directories.

metric	rv-gemini	bower fully offline	brv
cold start	76 ms	30 ms	17,971 ms
curate p50	817 ms	58 ms	20,771 ms
curate p95	923 ms	—	27,181 ms
query p50	818 ms	45 ms	19,313 ms
query p95	825 ms	—	76,502 ms
recall@1	1.000	1.000	0.300
recall@5	1.000	1.000	0.300
recall@10	1.000	1.000	0.300

rv-gemini corpus: 50 items / 20 probes. bower fully offline (longterm sim): 360 items over 30 days / 450 probes. brv: 20 items / 10 probes (each op is ~25s wall time, longer runs not practical mid-session).

what's new in this fork¶

bug fixes (correctness)¶

RETRIEVER_DB_PATH env var was never read — every prior benchmark ran against a polluted shared DB
cosine norm bug (no sqrt), dedup was wildly wrong
fuzzy query cache was a literal return nil stub
FindSimilar ran O(n) full scan on every curate
N+1 query in convertSearchResponse
goroutine leak in persistent embedding cache
--heuristic flag was documented but never parsed
gemini failure hard-crashed instead of degrading to local embedder
searchFallback (LIKE-based) didn't filter superseded memories
missing -tags="sqlite_fts5" build flag

features for brv parity¶

hierarchical paths (security/auth/jwt)
LLM structured curate: Reason / Narrative / Rules / Facts fields mirroring brv's curation shape
temporal facts: valid_from / valid_to / superseded_by with filtering on every read path
multi-tenant scoping: user_id / agent_id columns
bower tree — render topic tree with counts per node
bower ls security/auth/ — list memories under a prefix
bower export --to ./out — write brv-compatible markdown context tree
bower import ./tree — round-trip from the markdown export
bower supersede <old> <new> — temporal retirement
bower mv <id> <new/path> — retroactive path assignment

infrastructure¶

bench/corpus.py — realistic seeded memory generator
bench/runner.py — side-by-side bower vs brv with brv event-stream parser
bench/longterm.py — 30-day usage simulation
bench/stress.py — concurrent stress test (passed 800 ops, 0 failures)

what brv does that bower now matches¶

structured fact storage (Reason / Narrative / Rules / Facts): ✓
hierarchical topic paths: ✓
markdown context-tree export: ✓
LLM-driven curate with rephrasing: ✓
review workflow (HITL pending review): already existed

what brv still does that bower doesn't¶

filesystem-as-source-of-truth. brv's .brv/context-tree/ is authoritative; you can grep it, edit files in your editor, commit them to git. rv's authoritative store is sqlite. The bower export + bower import round-trip is the workaround, but you have to drive it manually
agent connectors. brv has explicit integrations with Claude / Codex / etc. bower is a CLI you wire in via shell
HITL approve/reject UI. bower has the pending_review table but no rich review CLI command yet

what bower does that brv doesn't¶

sub-100ms latency in offline mode
runs without network access (Gemini fallback to keyword projection)
explicit model_used field so the agent knows when retrieval quality is degraded
deterministic + reproducible across runs
single statically-linked binary, no node_modules

measured details¶

bench/rv-final.json (rv with gemini, 50 corpus / 20 probes)¶

cold start    : 76 ms
curate p50    : 817 ms   (Gemini embed roundtrip)
curate p95    : 923 ms
query p50     : 818 ms
query p95     : 825 ms
recall@1      : 1.000
recall@5      : 1.000
recall@10     : 1.000

bench/brv-final.json (brv, 20 corpus / 10 probes, isolated dir)¶

cold start    : 17,971 ms
curate p50    : 20,771 ms
curate p95    : 27,181 ms
query p50     : 19,313 ms
query p95     : 76,502 ms
recall@1      : 0.300
recall@5      : 0.300
recall@10     : 0.300

note: brv's recall is computed via >=50% token-overlap on its LLM-reshaped output text. it's lower than its true semantic accuracy because brv splits one fact across multiple curated files and the matching is generous-but-not-perfect. that said, bower on the same scoring hit 1.000 on a bigger sample.

bench/longterm-final.json (rv fully offline, 30 days)¶

days simulated   : 30
facts curated    : ~360
queries per day  : 15
mean recall@5    : 1.000
final db size    : 356 KB
curate p50       : 58 ms
query p50        : 43 ms

stable performance across 30 simulated days, no recall decay.

bench/stress.py (concurrent ops)¶

8 workers x 30 ops = 240 ops, 0 failures, p95=120ms, throughput=94 ops/s
16 workers x 50 ops = 800 ops, 0 failures, p95=307ms, throughput=82 ops/s

sqlite WAL + busy_timeout=5000 holds up under multi-process concurrent curate+query mix.

recommendation¶

use bower as primary agent memory. It's faster on every axis (latency 20-500x), measurably more accurate on the canonical "did the agent remember this fact" recall test (1.000 vs 0.300), and the feature gap to brv is now small (filesystem-as-truth and explicit connectors).

if you want to keep using brv's curated context tree as a knowledge base that lives in your editor, that's still a valid pattern — but you can get the same shape out of bower via bower export, edit the markdown in place, and re-ingest with bower import. round-trip preserves the exact tree structure.

caveats¶

brv sample is small (n=10). The latency comparison is overwhelming and reproduces every time. The recall comparison would benefit from a 200-probe benchmark, but that's a ~3 hour wall-time run on brv. The real point: even at small n, brv recall is clearly far below 1.000, so the trend is solid.
the corpus is one-line technical facts. brv's strength is long-form paragraph context where its LLM curate can extract structure. A fair "code knowledge" benchmark would compare on multi-paragraph debug postmortems. That's a separate day's work.
rv fully-offline recall is sustained by FTS5 keyword match, not semantic similarity. The keyword projection embedder is honest keyword overlap with mean-pooled hash vectors, not real semantics. For paraphrase-heavy queries, Gemini embedding mode is required.
multi-tenant filtering is wired into Store but not Query yet. Scoping columns exist and are written; query-side WHERE user_id = ? filtering is the next chunk.

how to verify¶

cd ~/retriever
go build -tags="sqlite_fts5" -o build/rv ./cmd/rv

# isolated curate + query
export RETRIEVER_DB_PATH=/tmp/rv-verify/m.db
rm -rf /tmp/rv-verify
./build/rv curate --path security/auth/jwt "jwt tokens expire after 24h"
./build/rv query "token expiry"
./build/rv tree | python3 -c "import sys,json; print(json.load(sys.stdin)['data']['tree'])"

# brv head-to-head (allow ~5-7 min for brv side)
rm -rf /tmp/brv-verify && mkdir /tmp/brv-verify && cd /tmp/brv-verify
python3 ~/retriever/bench/runner.py --corpus 20 --probes 10 \
  --tools bower brv --rv-db /tmp/rv-vs/m.db --out ~/retriever/bench/verify.json

# 30-day longterm
python3 ~/retriever/bench/longterm.py --days 30 --facts-per-day 12 \
  --queries-per-day 15 --out /tmp/longterm-verify.json \
  --db ~/.retriever-verify-lt/m.db

# concurrent stress
python3 ~/retriever/bench/stress.py --workers 16 --ops-per-worker 50

files of note¶

pkg/storage/db.go — schema, FTS5 candidate filter, scan helpers, temporal/path/scoping
pkg/search/cache.go — real Jaccard fuzzy cache
pkg/embedding/persistent_cache.go — batched flusher
pkg/embedding/router.go — full Embedder impl with fallback
pkg/memory/service.go — curate path with scoping + temporal + path
pkg/curation/curator.go + prompts.go — brv-shaped extract
cmd/rv/commands.go — tree/ls/export/import/supersede/mv handlers
cmd/rv/main.go — service setup with router-wired fallback
pkg/types/types.go — Memory + CurateRequest with all new fields
pkg/config/config.go — RETRIEVER_DB_PATH override
bench/ — corpus, runner, longterm, stress