changes in this fork¶

3-tier query cache + agentic harness¶

production-grade cache rewrite informed by mem0 v3, Zep/Graphiti, Letta/MemGPT, GPTCache, and three production semantic-cache writeups (Vadim 2024, Banking Case Study 2024, Respan 2026).

bugs fixed:

GetFuzzy was implemented and tested, but Search() never called it. Tier 1 was dead code from the start; every paraphrase query did a full hybrid search even when an existing entry would have hit at 100% jaccard. Now wired in.
scope-filter gate accepted cross-scope hits. The old sameScope compared the rebuilt key against itself, returning true unconditionally. A query under user=bob could serve a response cached under user=alice. The harness scope/alice-must-miss-bob-answer scenario catches this regression — now every entry stamps a filter-only signature and fuzzy/semantic lookups gate on it.
no pollution defense: empty result sets and zero-score responses were being cached, slowly poisoning the lookup space. Set() now rejects them and bumps a stat counter.

new features:

Tier 2 semantic cache: query embeddings stored per-entry, cosine match across cached entries. Catches synonym and reordering paraphrases that bypass Jaccard. Reuses the embedding already computed for the vector-index search → zero extra embedding cost on lookups.
3-zone confidence band (Vadim 2024): green ≥ 0.93 auto-serve, amber [0.88, 0.93) serve but stamp for FP review, red < 0.78 treated as miss. Threshold defaults calibrated against published BEIR/MS MARCO bi-encoder benchmarks.
per-tier TTL: T0 1h, T1 15m, T2 5m. Higher-FP-risk tiers expire faster (GPTCache TimeEvaluation pattern). Lazy eviction on Set, no background goroutine.
filter-aware cache keys: Limit and Filters map fold into both the exact key and the fuzzy/semantic scope gate. Two queries with identical text but different filters never collide.
CacheStats: atomic counters for lookups, hits per tier, amber hits, misses, rejected, evicted, invalidated. HitRate() for one-glance observability.
SampleHitsForReview(n): extract sampled hits for the production FP eval loop (the pattern every credible semantic-cache writeup recommends).
opts.NoCache: caller veto for one-off / debug queries.

agentic harness (pkg/search/agentic_harness_test.go):

TestAgenticHarnessFullCascade: scripted 10-turn agent conversation covering exact repeat, token-permutation, synonym paraphrase, cross-scope filter switch, and cold OOD queries. Each turn asserts the expected tier or miss.
TestAgenticHarnessFalsePositiveBudget: 30 seeded facts, 20 paraphrases, 5 cold queries, asserts FP rate stays under 10% of total hits. Synthetic bag-of-words embedder; real embeddings would do better.
TestAgenticHarnessFingerprintInvalidationUnderLoad: warms 20 entries, mutates fingerprint, verifies all are flushed and zero stale hits leak.
TestAgenticHarnessConcurrentReadWrite: 16 readers + 16 writers for 500ms confirms no torn state, race, or panic. go test -race is unavailable on android/arm64 but this exercises the same code paths.
BenchmarkCacheTierLatency: on Termux Android arm64, 256-dim vectors, 100 entries: T0 ~21μs, T2 ~41μs, T1 ~72μs. T1 is the slowest because map intersection walks every entry; T2 is faster than T1 on small caches because dot products are branch-free and SIMD-friendly.

robustness pass + feature parity push against brv, plus filesystem-as- source-of-truth so the markdown tree on disk is now authoritative.

filesystem-as-source-of-truth (the headline change)¶

new pkg/treestore/ package + Service.Reindex + bower reindex command. When RETRIEVER_STORAGE_DIR is set (or storage_dir in config.json), rv operates in filesystem-first mode:

Every curate writes a markdown file to disk first, then the sqlite row, then the in-memory vector index.
Every cold-start runs a stale-detection pass: walk the tree, compare each file's mtime against the tree_index manifest, apply drift (insert / update / delete) before serving the first query.
You can edit memories in vim/vscode/obsidian and the next query sees the change automatically.
You can rm a file and the memory disappears.
You can rm -rf the sqlite db and the next query rebuilds the entire index from disk in seconds.

File layout:

<root>/<path>/<slug>-<short-id>.md

with YAML frontmatter (id, type, path, importance, maturity, created, tags, etc) and level-2 markdown sections (Reason / Raw Concept / Narrative / Rules / Facts). The id in frontmatter is stable across edits; the slug prefix on the filename is regenerated from summary so renaming files is also safe (id wins over filename for identity).

Verified end-to-end on a 5-memory corpus: edit in place → query auto- reindexes in ~700ms (gemini re-embed cost). Full DB nuke → next query rebuilds 5 memories in ~3s. rm <file>.md → memory gone in 2ms.

New commands:

bower reindex          # explicit reindex pass

New config:

storage_dir: /path/to/tree            (config.json)
RETRIEVER_STORAGE_DIR=/path/to/tree   (env override)

When storage_dir is empty, bower runs in the original sqlite-only mode with no behaviour change.

previous robustness pass¶

bug fixes that were silently degrading correctness, new commands that bring bower to brv shape parity, real benchmarks with proper isolation, and a few honest caveats called out at the bottom.

bugs fixed¶

RETRIEVER_DB_PATH was a fake env var. Never read by any Go code. Every benchmark and test before this fix ran against the same shared ~/.retriever/memory.db, polluting results. Now config.GetDBPath() checks the env var first.
cosine norm bug. Dedup compared against normA * normB where the "norms" were actually squared L2 sums (no sqrt). Every cosine result was therefore degraded by a factor of ~||a|| * ||b||. Fixed in pkg/storage/db.go.
fuzzy query cache was a literal stub. Tier-1 fuzzy lookup discarded all inputs and returned nil. Now does real Jaccard similarity on tokenized queries with pre-built token sets.
FindSimilar was O(n) full-scan on every curate. Added FindSimilarWithText that pre-filters via FTS5 then runs cosine on the shortlist (~200 candidates max). Curate dedup cost is now logarithmic in DB size.
N+1 query in convertSearchResponse. Was calling db.Get(id) per result. Replaced with GetMany(ids) bulk fetch.
goroutine race/leak in persistent embedding cache. Every Get spawned an unsupervised goroutine to write the access_count. Replaced with a single background flusher that coalesces updates every 500ms in a transaction.
--heuristic flag was documented but never parsed.
gemini failure crashed instead of degrading. Wired the existing Router so when Gemini init fails, the system falls through to the local keyword-projection embedder. Memories show model_used: onnx-hash-fallback so the agent knows the result quality is degraded.
searchFallback (LIKE-based) didn't filter superseded memories. Caught only after I started using bower supersede for real. Fixed.
missing -tags="sqlite_fts5" build flag caused no such module: fts5 errors. Documented in build instructions.

features added¶

hierarchical paths¶

new path column on memories (indexed, COLLATE NOCASE for tree browse)
CurateRequest.Path field, CLI flag --path security/auth/jwt
merge-update preserves existing path when caller passes empty

LLM structured curate¶

new Reason, Narrative, Rules, Facts columns on memories
CurationSystemPrompt updated to demand path/reason/narrative/rules/facts in extraction JSON, mirroring brv's curated-fact shape
Update uses COALESCE-if-empty semantics so re-curate doesn't blank existing structured fields

temporal facts (Zep-style)¶

valid_from, valid_to, superseded_by columns + index on valid_to
db.Supersede(old, new) marks oldID as superseded; old memory stays in DB for audit but vanishes from default retrieval
db.RevalidateForce(id) clears the supersede flag (repair path)
ALL read paths now filter valid_to = 0: SearchFTS, AllEmbeddings, ListByPath, PathCounts, searchFallback, embeddingsForFTSCandidates

multi-tenant scoping (mem0-style)¶

user_id, agent_id columns with partial indexes
CLI flags --user X --agent Y
(filter-side wired through Store; read filtering is the next step)

new commands¶

bower tree [prefix] [--depth N] — render topic tree with counts per node
bower ls [prefix] [--limit N] — list memories under path prefix
bower export --to ./out — write brv-compatible markdown context tree with YAML frontmatter and ## Reason / ## Raw Concept / ## Narrative / ## Rules / ## Facts sections
bower import ./tree — round-trip from the markdown export (or any brv-style tree). Parses frontmatter for path/type/tags, body sections for the structured fields
bower supersede <old-id> <new-id> — temporal retirement
bower mv <id> <new/path> — retroactive path assignment

infrastructure¶

bench/corpus.py — realistic Hermes-style memory generator across 5 categories x dozens of templates
bench/runner.py — side-by-side bower vs brv benchmark with proper brv event-stream JSON parsing, recall@k via token-overlap fuzzy match
bench/longterm.py — multi-day usage simulation (DB growth, recall decay, latency over time)
bench/stress.py — N-worker concurrent stress test

refactors¶

memColumns constant + memScan helper struct unify the 7 SELECT sites. Adding a new column now only touches one place.
Router (pkg/embedding/router.go) now satisfies the Embedder interface so it can slot in anywhere a plain embedder is expected. ModelName returns a composite like router(gemini-embedding-001 -> onnx-hash-fallback).

measured numbers (proper DB isolation)¶

metric	rv-gemini	bower heuristic curate (gemini embed)	bower fully offline	brv
cold start	76 ms	102 ms	30 ms	17,837 ms
curate p50	817 ms	817 ms	58 ms	23,395 ms
query p50	818 ms	815 ms	45 ms	17,556 ms
recall@5	1.000	1.000	1.000	0.800

corpus size: 50 / 20 probes. fully-offline curate hits 58ms because the keyword-projection embedder skips the Gemini network roundtrip. longterm sim (30 days, 12 facts/day, 15 queries/day) sustained recall@5 = 1.0 with DB stable at ~350KB and query p50 holding at 43ms.

stress test: 8 workers x 30 ops = 240 ops, 0 failures, p95=120ms. 16 workers x 50 ops = 800 ops, 0 failures, p95=307ms. sqlite WAL + busy_timeout holds.

caveats¶

the RETRIEVER_DB_PATH discovery invalidates the previous report's recall numbers. The earlier "rv 0.875 vs brv 0.800" comparison was noise from cross-contaminated benchmarks. With proper isolation rv actually hits perfect recall on this corpus, but the gap to brv may also be wider than originally claimed (brv didn't have the same bug).
brv side of the comparison is still small (n=10-20). Each brv op takes ~25s wall time so a 200-probe benchmark would take ~3 hours. The latency comparison is rock solid (rv is dramatically faster, every single run replicates), the recall comparison would benefit from a bigger brv sample if you ever want to make a stronger claim.
rv-heuristic mode skips the LLM curator but NOT the embedder. The "heuristic" curate is heuristic about content extraction, not about embedding. To get the ~60ms numbers you need fully-offline mode (env -u GEMINI_API_KEY -u GOOGLE_API_KEY), which uses the keyword projection embedder instead of Gemini.
multi-tenant filtering is wired into Store but not Query. Memory gets the scoping fields, but no query-side WHERE user_id = ? filtering yet. Trivial to add when needed.
vector index rebuilds at every cold-start from AllEmbeddings(). That's fine for current DB sizes (~50ms even at 1000 memories) but would need an ANN index for 100K+ memories.

file map¶

pkg/storage/db.go — schema, scan, FTS5, temporal, paths, dedup
pkg/search/cache.go — Tier 0/1 query cache (real Jaccard now)
pkg/embedding/persistent_cache.go — batched access-count flusher
pkg/embedding/router.go — full Embedder impl with graceful fallback
pkg/memory/service.go — curate path with scoping + temporal + path
pkg/curation/curator.go — LLM-driven extract → store with path
pkg/curation/prompts.go — extraction prompt + ExtractedMemory shape
cmd/rv/commands.go — handleTree/Ls/Export/Import/Supersede/Mv
cmd/rv/main.go — service setup with router-wired fallback
pkg/types/types.go — Memory + CurateRequest + scoping/temporal/path
pkg/config/config.go — RETRIEVER_DB_PATH env honored
bench/ — corpus / runner / longterm / stress