bowerbird — architecture¶
Status: ✅ Implemented — 8,629 lines across Go + Python. All Phase 1-3 complete. Date: 2026-06-02 Purpose: This document describes the architecture AS BUILT. The implementation matches this specification. See README.md for usage and quick start.
1. LANGUAGE CHOICE¶
1.1 Decision: Go (core) + Python (Hermes plugin)¶
Principle: Each language does what it does best.
| Component | Language | Rationale |
|---|---|---|
| Core engine (storage, search, embedding, lifecycle) | Go 1.26 | Single binary, 5-15ms startup, goroutines, zero-runtime deployment |
| Hermes MemoryProvider plugin | Python 3.11+ | Hermes is Python; native abc.ABC subclass, no FFI |
| Optional TypeScript CLI wrapper | TypeScript | Only if Node.js ecosystem integration is needed; Go CLI is primary |
| Vector SIMD acceleration (optional) | C via CGo | Only if Go scalar vector math becomes the bottleneck at >500K vectors |
1.2 Why Go Is The Correct Core Language¶
Startup time: 5-15ms (vs 200-500ms Node, 50-100ms Python)
Concurrency: Goroutines with work-stealing scheduler -- trivial fan-out
Deployment: Single statically-linked binary, `curl ... | sh` installable
Memory: No GC pauses >1ms with GOGC tuning
Ecosystem: mature sqlite3 driver (mattn/go-sqlite3), CGo for SIMD escape hatch
Dev velocity: 2x faster than Rust for mutation-heavy memory lifecycle code
1.3 When To Introduce A Second Language¶
Hard gates, not preferences:
- Python: Only for
plugins/hermes/. Never in the hot path. Python talks to Go viasubprocess.run(["rv", "query", text])-- the exact same contract ByteRover uses. - C/Rust via CGo: Only when p95 vector search latency exceeds 50ms at corpus >500K.
The
EmbedderandVectorIndexinterfaces are already swappable. Do not introduce CGo before this threshold is crossed. - TypeScript: Only if a TypeScript SDK is needed for npm ecosystem consumers. The Go CLI handles all operational use.
2. CORE ARCHITECTURE¶
2.1 Process Model: Embedded Library with CLI Frontend¶
Retriever is not a daemon. It is a library with a CLI frontend. This is the fundamental architectural constraint that differentiates it from Qdrant, ChromaDB, and other server-first memory systems.
┌──────────────────────┐
│ bower (Go binary) │
│ │
Hermes ──subprocess──► cmd/rv/main.go │
│ │ │
Claude Code ──MCP──► cmd/rv/mcp.go │
│ │ │
Web dashboard ──HTTP─► cmd/rv/serve.go │
│ │ │
│ ◄───pkg/───────────┤
│ memory.Service │
│ search.Engine │
│ storage.DB │
│ embedding.Embedder │
└──────────────────────┘
Three modes of operation:
| Mode | Invocation | Use Case |
|---|---|---|
| One-shot CLI | bower query "text" |
Hermes subprocess, shell pipelines, scripts |
| STDIO MCP | bower mcp |
Claude Code, Cursor, any MCP host |
| HTTP daemon | bower serve |
Web dashboards, multi-client, persistent |
Startup sequence (one-shot CLI):
T+0ms: Process start (kernel ELF load)
T+2ms: main() parses args, loads config from ~/.retriever/config.json
T+5ms: storage.Open() -- sqlite3 with WAL, ~32MB cache
T+8ms: Load vector index from SQLite embeddings table into RAM (~130MB at 100K x 768D)
T+12ms: GeminiEmbedder initialized (validates API key, no network call yet)
T+15ms: Ready to serve query
Why not a daemon: Daemons require process supervision, port management, and add failure modes (is it running? which port? stale PID file?). ByteRover tried this and it was a pain point. The one-shot model with lazy initialization is simpler, more reliable, and still hits the sub-50ms latency target.
When to use bower serve (HTTP daemon): Only when you need concurrent multi-client
access or want to avoid the 15ms startup per query. The daemon loads the vector index
once and serves queries with 5-8ms startup overhead eliminated.
2.2 Communication with Hermes Agent¶
Hermes communicates with Retriever via the exact same subprocess contract that
ByteRover (brv) uses. This is deliberate: drop-in replacement, zero Hermes code changes.
Hermes MemoryProvider contract:
# plugins/hermes/provider.py
class RetrieverProvider(MemoryProvider):
"""
Implements Hermes MemoryProvider ABC.
Communicates with bower binary via subprocess.
"""
def prefetch(self, query: str) -> list[Memory]:
"""Called before each LLM turn. Returns context to inject."""
result = subprocess.run(
["rv", "query", query, "--limit", "5"],
capture_output=True, text=True, timeout=10
)
data = json.loads(result.stdout)
return [self._to_memory(r) for r in data["data"]["results"]]
def sync_turn(self, messages: list[Message]) -> None:
"""Called after each LLM turn. Extracts and stores memories."""
text = self._extract_curatable_content(messages)
if text:
subprocess.run(
["rv", "curate", text],
capture_output=True, text=True, timeout=10
)
Key integration points:
- prefetch(query) -> list[Memory]: Called BEFORE the LLM sees the next user message. This is where predictive prefetching lives (v0.3).
- sync_turn(messages) -> None: Called AFTER the LLM responds. This is where memory extraction and curation happens.
- JSON over stdout: All CLI commands output a uniform JSON envelope:
{"command": "query", "success": true, "data": {...}}
2.3 Multi-Language Bridging Strategy¶
The bridge is JSON over stdout/stdin. No gRPC, no Unix sockets, no FFI.
Python (Hermes) Go (rv binary) TypeScript (SDK)
│ │ │
│ subprocess.run() │ │
├── stdout JSON ────────►│ │
│◄── stdin JSON ─────────┤ │
│ │ │
│ ├── MCP STDIO ───────────►│
│ │◄── JSON-RPC ────────────┤
Why not FFI (CGo -> Python, or PyO3 -> Rust):
- FFI adds build complexity (Python headers, shared library loading).
- Subprocess isolation means a crash in rv cannot corrupt Hermes' memory space.
- JSON serialization is <2ms for typical payload sizes.
- The contract is versioned: bower --version reports SemVer, Hermes can gate features.
3. STORAGE ARCHITECTURE¶
3.1 Database Selection: SQLite 3.44+ with WAL, FTS5, and BLOB vector storage¶
Single file: ~/.retriever/memory.db
SQLite PRAGMA configuration (applied at connection open):
PRAGMA journal_mode=WAL; -- Concurrent reads during write
PRAGMA synchronous=NORMAL; -- Safe with WAL, 2x write speed
PRAGMA cache_size=-32000; -- 32MB page cache
PRAGMA busy_timeout=5000; -- 5s wait on lock (single writer is fine)
PRAGMA foreign_keys=ON; -- Enforce referential integrity
PRAGMA mmap_size=268435456; -- 256MB memory-mapped I/O
PRAGMA temp_store=MEMORY; -- Temp tables in RAM
3.2 Schema Design¶
Table: memories (core memory records)¶
CREATE TABLE memories (
id TEXT PRIMARY KEY, -- hex-encoded SHA-256[:16]
type TEXT NOT NULL DEFAULT 'fact', -- fact|pattern|decision|procedure|context
content TEXT NOT NULL, -- Full memory text
summary TEXT NOT NULL DEFAULT '', -- First-sentence extractive summary
importance REAL NOT NULL DEFAULT 0.5, -- [0.0, 1.0] computed importance
access_count INTEGER NOT NULL DEFAULT 0, -- Number of times retrieved
create_time INTEGER NOT NULL, -- Unix milliseconds
access_time INTEGER NOT NULL, -- Unix milliseconds, last retrieval
decay_rate REAL NOT NULL DEFAULT 0.01, -- Per-memory decay factor
source_conv_id TEXT, -- Conversation that created this memory (v0.2)
supersedes_id TEXT, -- ID of memory this one replaces (v0.2)
confidence REAL NOT NULL DEFAULT 1.0, -- [0.0, 1.0] source confidence (v0.2)
tags TEXT NOT NULL DEFAULT '[]', -- JSON array of strings
keywords TEXT NOT NULL DEFAULT '[]', -- JSON array of extracted keywords
metadata TEXT NOT NULL DEFAULT '{}' -- JSON object for extensibility
);
CREATE INDEX idx_memories_type ON memories(type);
CREATE INDEX idx_memories_access_time ON memories(access_time);
CREATE INDEX idx_memories_importance ON memories(importance DESC);
CREATE INDEX idx_memories_create_time ON memories(create_time);
Virtual Table: memories_fts (BM25 full-text search via FTS5)¶
CREATE VIRTUAL TABLE memories_fts USING fts5(
summary,
content,
tags,
keywords,
content='memories',
content_rowid='rowid',
tokenize='porter unicode61 remove_diacritics 2'
);
-- Triggers keep FTS5 synchronized with memories table
CREATE TRIGGER memories_fts_insert AFTER INSERT ON memories BEGIN
INSERT INTO memories_fts(rowid, summary, content, tags, keywords)
VALUES (new.rowid, new.summary, new.content, new.tags, new.keywords);
END;
CREATE TRIGGER memories_fts_delete AFTER DELETE ON memories BEGIN
INSERT INTO memories_fts(memories_fts, rowid, summary, content, tags, keywords)
VALUES ('delete', old.rowid, old.summary, old.content, old.tags, old.keywords);
END;
CREATE TRIGGER memories_fts_update AFTER UPDATE ON memories BEGIN
INSERT INTO memories_fts(memories_fts, rowid, summary, content, tags, keywords)
VALUES ('delete', old.rowid, old.summary, old.content, old.tags, old.keywords);
INSERT INTO memories_fts(rowid, summary, content, tags, keywords)
VALUES (new.rowid, new.summary, new.content, new.tags, new.keywords);
END;
Table: embeddings (vector storage as BLOBs)¶
CREATE TABLE embeddings (
memory_id TEXT PRIMARY KEY REFERENCES memories(id) ON DELETE CASCADE,
embedding BLOB NOT NULL, -- float32[] as little-endian bytes (4 bytes per element)
model TEXT NOT NULL DEFAULT '', -- e.g. "text-embedding-004"
dimension INTEGER NOT NULL DEFAULT 768,
created_at INTEGER NOT NULL DEFAULT (unixepoch('subsec') * 1000)
);
CREATE INDEX idx_embeddings_model ON embeddings(model);
BLOB encoding format:
- Each float32 is stored as 4 bytes, little-endian.
- For 768 dimensions: 768 * 4 = 3072 bytes per row.
- At 100K memories: 100000 * 3072 = ~293 MB on disk, ~293 MB in RAM when loaded.
- With int8 quantization: 100000 * 768 * 1 = ~73 MB.
Table: relations (typed, weighted graph edges)¶
CREATE TABLE relations (
source_id TEXT NOT NULL REFERENCES memories(id) ON DELETE CASCADE,
target_id TEXT NOT NULL REFERENCES memories(id) ON DELETE CASCADE,
type TEXT NOT NULL, -- causes|informs|contradicts|supersedes|example_of|prerequisite_for|led_to|related_to
strength REAL NOT NULL DEFAULT 1.0, -- [0.0, 1.0]
created_at INTEGER NOT NULL DEFAULT (unixepoch('subsec') * 1000),
PRIMARY KEY (source_id, target_id, type)
);
CREATE INDEX idx_relations_source ON relations(source_id);
CREATE INDEX idx_relations_target ON relations(target_id);
CREATE INDEX idx_relations_type ON relations(type);
Table: embedding_cache (persistent embedding cache, survives restarts)¶
CREATE TABLE embedding_cache (
content_hash TEXT PRIMARY KEY, -- SHA-256 hex digest of input text
embedding BLOB NOT NULL, -- float32[] as little-endian bytes
model TEXT NOT NULL, -- which model produced this
dimension INTEGER NOT NULL,
created_at INTEGER NOT NULL,
hit_count INTEGER NOT NULL DEFAULT 0
);
CREATE INDEX idx_embedding_cache_model ON embedding_cache(model);
Table: audit_log (memory provenance, v0.2)¶
CREATE TABLE audit_log (
id INTEGER PRIMARY KEY AUTOINCREMENT,
memory_id TEXT NOT NULL REFERENCES memories(id) ON DELETE CASCADE,
action TEXT NOT NULL, -- created|updated|merged|pruned|accessed|contradicted
timestamp INTEGER NOT NULL,
details TEXT NOT NULL DEFAULT '{}' -- JSON: old values, reason, etc.
);
CREATE INDEX idx_audit_memory ON audit_log(memory_id);
CREATE INDEX idx_audit_timestamp ON audit_log(timestamp);
Table: schema_version (migration tracking)¶
CREATE TABLE schema_version (
version INTEGER PRIMARY KEY,
applied_at INTEGER NOT NULL,
description TEXT NOT NULL
);
3.3 Index Strategy¶
| Query Pattern | Index Used | Complexity |
|---|---|---|
| FTS5 keyword search | memories_fts (virtual table, trigram tokenizer) |
O(log N) with bm25 scoring |
| Vector similarity (brute force) | In-memory []float32 slices, no SQL index |
O(N * D) where N=memories, D=dims |
| Get memory by ID | memories.id PRIMARY KEY |
O(1) B-tree lookup |
| List by type | idx_memories_type |
O(log N) |
| Prune by decay | idx_memories_access_time + idx_memories_importance |
O(N) scan, indexed sort |
| Graph neighbors | idx_relations_source + idx_relations_target |
O(log N) per hop |
| Cache lookup | embedding_cache.content_hash PRIMARY KEY |
O(1) |
3.4 Data Layout On Disk¶
~/.retriever/
├── memory.db -- SQLite database (all tables)
├── memory.db-wal -- Write-Ahead Log (auto-managed by SQLite)
├── memory.db-shm -- Shared memory for WAL index
└── config.json -- User configuration
Expected sizes (empirically estimated):
| Corpus Size | DB File | WAL (typical) | RAM (float32 vectors) | RAM (int8 vectors) |
|---|---|---|---|---|
| 1K memories | ~5 MB | <1 MB | ~3 MB | ~0.8 MB |
| 10K memories | ~40 MB | ~2 MB | ~27 MB | ~7 MB |
| 100K memories | ~350 MB | ~5 MB | ~270 MB | ~68 MB |
| 500K memories | ~1.7 GB | ~10 MB | ~1.3 GB | ~340 MB |
| 1M memories | ~3.5 GB | ~20 MB | ~2.7 GB | ~680 MB |
4. EMBEDDING ARCHITECTURE¶
4.1 Model Selection Matrix¶
QUALITY ───────────────────────────────►
MTEB Retrieval score (higher is better)
Model Score Dims Cost Latency Offline Best For
──────────────────────────────────────────────────────────────────────────────────
Gemini text-embedding-004 80.3% 768 Free ~50ms No Default: good enough, free, fast
OpenAI 3-large 89.3% 3072 $0.13/1M tk ~50ms No Maximum quality when budget allows
OpenAI 3-small 85.1% 512 $0.02/1M tk ~30ms No Budget API with good quality
Voyage-3-large 90.1% 2048 Paid ~60ms No Best quality, highest cost
Voyage-3-lite 87.2% 512 Paid ~30ms No Voyage budget option
all-MiniLM-L6-v2 (ONNX) ~75% 384 Free ~5ms Yes Local fallback, CI/CD, air-gapped
bge-small-en-v1.5 (ONNX) ~78% 384 Free ~8ms Yes Better local quality
bge-base-en-v1.5 (ONNX) ~82% 768 Free ~15ms Yes Best local quality
gte-small (ONNX) ~79% 384 Free ~6ms Yes General-purpose local
4.2 Embedding Provider Selection Logic¶
// Embedder selection waterfall -- evaluated at startup
func SelectEmbedder(cfg EmbeddingConfig) (Embedder, error) {
// 1. Explicit provider in config
switch cfg.Provider {
case "gemini":
return NewGeminiEmbedder(...)
case "openai":
return NewOpenAIEmbedder(...)
case "voyage":
return NewVoyageEmbedder(...)
case "onnx":
return NewONNXEmbedder(cfg.ONNXModelPath)
case "local":
return NewONNXEmbedder(autoDetectBestLocalModel())
case "auto", "":
// Fall through to auto-detection
}
// 2. Check for API keys in environment
if key := os.Getenv("GEMINI_API_KEY"); key != "" {
return NewGeminiEmbedder(GeminiEmbedderConfig{APIKey: key})
}
if key := os.Getenv("OPENAI_API_KEY"); key != "" {
return NewOpenAIEmbedder(OpenAIEmbedderConfig{APIKey: key})
}
// 3. Fall back to local ONNX
modelPath := autoDetectBestLocalModel()
if modelPath != "" {
return NewONNXEmbedder(modelPath)
}
return nil, fmt.Errorf("no embedding provider available: set GEMINI_API_KEY or install ONNX models")
}
func autoDetectBestLocalModel() string {
// Search paths in order of quality preference:
paths := []string{
"~/.retriever/models/bge-base-en-v1.5.onnx",
"~/.retriever/models/bge-small-en-v1.5.onnx",
"~/.retriever/models/all-MiniLM-L6-v2.onnx",
}
for _, p := range paths {
if _, err := os.Stat(p); err == nil {
return p
}
}
return ""
}
4.3 Batching Strategy¶
API embeddings (Gemini, OpenAI, Voyage): - Individual requests parallelized with a concurrency semaphore of 10 goroutines. - Gemini's API does not support true batch embedding. Each text is a separate HTTP request. Parallelism is the only optimization. - OpenAI supports batches of up to 2048 texts per request. Use this when the Embedder is OpenAI. - Voyage supports batches of up to 128 texts per request.
// Embedding concurrency configuration
const (
MaxConcurrentAPIRequests = 10 // Limit to avoid rate-limit hammering
APIBatchSize = 100 // Texts per batch call when API supports it
BatchTimeout = 5 * time.Second
)
Local ONNX embeddings: - ONNX Runtime processes one text at a time (no native batching without dynamic axis). - Single-threaded, <10ms per text for 384-dim models. - For bulk operations (curate with many memories), run sequentially with progress.
4.4 Caching Architecture¶
Three-tier cache hierarchy:
Tier 1: In-memory LRU (sync.Map + ring buffer)
├── Size: 10,000 entries (~30 MB for 768-dim)
├── Eviction: LRU with TTL of 24 hours
├── Latency: ~100ns (map lookup)
└── Hit rate: >95% for stable content
Tier 2: SQLite persistent cache (embedding_cache table)
├── Size: Unlimited (disk-backed)
├── Eviction: Manual (rv cache prune --older-than 30d)
├── Latency: ~0.5ms (indexed B-tree lookup)
└── Survives restarts
Tier 3: Pre-compute during idle
├── After `bower curate`, a background goroutine warms the cache
├── For all uncached memories, embed and store
└── Configurable: bower warm --all or automatic
Cache key derivation:
func CacheKey(text string, model string) string {
h := sha256.Sum256([]byte(model + ":" + text))
return hex.EncodeToString(h[:])
}
Cache invalidation rules: - Model change: all entries for old model are invalidated (different embedding space). - Content change: the cache key changes automatically (content-hash-based). - Never expires based on time alone for API embeddings (they are immutable). - TTL of 24h for in-memory LRU tier only (to bound RAM).
4.5 Local vs API Tradeoff Decisions¶
| Factor | API (Gemini) | Local (ONNX) |
|---|---|---|
| Quality | 80.3% MTEB | ~75-78% MTEB |
| Latency (single) | 50ms (network) | 5-8ms (CPU) |
| Latency (batch 100) | ~500ms (10 concurrent) | ~500-800ms (sequential) |
| Cost | Free (1500 RPM limit) | Free (no limit) |
| Offline | No | Yes |
| Setup | API key env var | ~200MB model download |
| Dimension | 768 | 384 |
| Privacy | Text leaves machine | Everything local |
| Rate limit | 1500 requests/minute | Unlimited |
Decision matrix for automatic selection:
Is GEMINI_API_KEY set?
├── YES → Use Gemini (free, good quality, fast enough)
│ └── Is network unreachable?
│ └── YES → Log warning, use local ONNX fallback
└── NO → Is OPENAI_API_KEY set?
├── YES → Use OpenAI (best quality, paid)
└── NO → Use local ONNX (always available, no keys needed)
5. VECTOR SEARCH¶
5.1 Algorithm Selection: Phased Strategy¶
Phase 1: Brute-force cosine similarity (MVP, <100K vectors)
Algorithm: Exhaustive scan with goroutine parallelism
Partition: Split vector space into GOMAXPROCS shards
SIMD: Pure Go scalar (no CGo dependency)
Memory: Float32 vectors in contiguous []float32 slices
Latency: ~12ms for 100K x 768D (with 8 goroutines)
Phase 2: int8 quantization (v0.2, <500K vectors)
Algorithm: Same brute-force, but on quantized int8 vectors
Quantization: Per-dimension min/max scaling to [-127, 127]
SIMD: Go assembler or simsimd CGo bindings
Memory: 4x reduction vs float32
Latency: ~8ms for 100K x 768D (smaller data, better cache)
Phase 3: HNSW index (v0.4, >500K vectors)
Algorithm: HNSW (Hierarchical Navigable Small World)
Parameters: M=16, efConstruction=200, efSearch=50
Recall: ~95% @ k=10 (vs brute-force baseline)
Memory: ~2x float32 vectors (graph edges + vectors)
Latency: <5ms regardless of corpus size
Implementation: Custom pure-Go HNSW or chromem-go integration
5.2 Phase 1: Brute-Force Implementation¶
// VectorIndex is the in-memory vector search index.
type VectorIndex struct {
mu sync.RWMutex
vectors map[string][]float32 // memoryID -> embedding
dims int
shards int // Number of parallel shards
}
// Search finds the top-K most similar vectors to the query.
func (vi *VectorIndex) Search(query []float32, k int, minScore float32) []ScoredID {
vi.mu.RLock()
defer vi.mu.RUnlock()
// Partition IDs across shards
ids := make([]string, 0, len(vi.vectors))
for id := range vi.vectors {
ids = append(ids, id)
}
// Each shard computes top-K for its partition
shardSize := (len(ids) + vi.shards - 1) / vi.shards
results := make(chan []ScoredID, vi.shards)
for s := 0; s < vi.shards; s++ {
start := s * shardSize
end := min(start+shardSize, len(ids))
if start >= end {
results <- nil
continue
}
go func(partition []string) {
shardResults := bruteForceTopK(vi.vectors, partition, query, k, minScore)
results <- shardResults
}(ids[start:end])
}
// Merge shard results
all := make([]ScoredID, 0, k*vi.shards)
for s := 0; s < vi.shards; s++ {
shardResults := <-results
all = append(all, shardResults...)
}
// Global top-K
sort.Slice(all, func(i, j int) bool { return all[i].Score > all[j].Score })
if len(all) > k {
all = all[:k]
}
return all
}
func bruteForceTopK(
vectors map[string][]float32,
ids []string,
query []float32,
k int,
minScore float32,
) []ScoredID {
// Min-heap of size K
heap := &boundedHeap{k: k}
heap.items = make([]ScoredID, 0, k)
for _, id := range ids {
vec := vectors[id]
score := cosineSimilarity(query, vec)
if score >= minScore {
heap.Push(ScoredID{ID: id, Score: score})
}
}
return heap.items
}
5.3 Phase 2: int8 Quantization (v0.2)¶
// QuantizedVector stores an int8-quantized embedding.
type QuantizedVector struct {
Min float32 // Per-vector minimum value
Max float32 // Per-vector maximum value
Data []int8 // Quantized dimensions, length = dims
}
// Quantize converts float32 embedding to int8.
func Quantize(vec []float32) QuantizedVector {
minVal, maxVal := float32(math.MaxFloat32), float32(-math.MaxFloat32)
for _, v := range vec {
if v < minVal { minVal = v }
if v > maxVal { maxVal = v }
}
scale := 255.0 / (maxVal - minVal)
data := make([]int8, len(vec))
for i, v := range vec {
data[i] = int8((v - minVal) * scale - 128)
}
return QuantizedVector{Min: minVal, Max: maxVal, Data: data}
}
// CosineSimilarityInt8 computes approximate cosine similarity on quantized vectors.
// Achieves ~0.99 correlation with float32 computation.
func CosineSimilarityInt8(a, b QuantizedVector) float32 {
var dot int32
var normA, normB int32
for i := range a.Data {
dot += int32(a.Data[i]) * int32(b.Data[i])
normA += int32(a.Data[i]) * int32(a.Data[i])
normB += int32(b.Data[i]) * int32(b.Data[i])
}
if normA == 0 || normB == 0 {
return 0
}
return float32(dot) / (float32(normA) * float32(normB))
}
5.4 Latency Targets By Corpus Size¶
| Corpus Size | Phase | Algorithm | p50 Latency | p95 Latency | RAM |
|---|---|---|---|---|---|
| 1K vectors | 1 | Brute-force float32 | <1ms | <2ms | ~3 MB |
| 10K vectors | 1 | Brute-force float32 | ~2ms | <5ms | ~27 MB |
| 50K vectors | 1 | Brute-force float32 | ~8ms | <15ms | ~135 MB |
| 100K vectors | 1 | Brute-force float32 (8 shards) | ~12ms | <25ms | ~270 MB |
| 100K vectors | 2 | Brute-force int8 (8 shards) | ~6ms | <12ms | ~68 MB |
| 500K vectors | 2 | Brute-force int8 (8 shards) | ~30ms | <50ms | ~340 MB |
| 500K vectors | 3 | HNSW float32 | <4ms | <8ms | ~540 MB |
| 1M+ vectors | 3 | HNSW + sharding | <8ms | <15ms | ~2.7 GB |
5.5 When To Trigger Phase Transition¶
func (vi *VectorIndex) shouldUpgrade() IndexTier {
count := vi.Count()
switch {
case count < 100_000:
return TierBruteForceFloat32
case count < 500_000:
return TierBruteForceInt8
default:
return TierHNSW
}
}
// Automatic upgrade on Insert if threshold crossed
func (vi *VectorIndex) Insert(id string, vec []float32) error {
vi.mu.Lock()
defer vi.mu.Unlock()
vi.vectors[id] = vec
if len(vi.vectors) > 100_000 && vi.tier == TierBruteForceFloat32 {
vi.upgradeToInt8() // async, non-blocking
}
if len(vi.vectors) > 500_000 && vi.tier == TierBruteForceInt8 {
vi.upgradeToHNSW() // async, non-blocking
}
return nil
}
6. MEMORY LIFECYCLE¶
6.1 Complete Lifecycle State Machine¶
┌─────────┐
│ Empty │
└────┬────┘
│ bower curate "text"
▼
┌─────────┐
│ Created │──→ importance = computeImportance(content)
└────┬────┘ decay_rate = 0.01
│ confidence = 1.0 (or LLM-provided)
│
┌──────────┼──────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌─────────┐
│ Active │ │Merged │ │Superseded│
│ │ │(duplicate│(explicit │
│normal │ │ found) │ replace)│
└───┬────┘ └────┬───┘ └────┬────┘
│ │ │
┌────────┼───────┐ │ │
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌───────┐ ┌──────┐ ┌──────┐ ┌──────────┐
│Accessed│ │Decayed│ │Pruned│ │Compressed │
│(boost) │ │(low │ │(<thr-│ │(merged │
│ │ │ eff) │ │ eshold)│ │into pattern)
└───┬────┘ └──┬───┘ └──┬───┘ └──────────┘
│ │ │
└─────────┘ │
(loops back) │
┌───▼───┐
│Deleted │
└───────┘
6.2 Importance Scoring Algorithm¶
// ComputeImportance calculates the initial importance score for new content.
// Combines multiple signals into a [0.0, 1.0] score.
func ComputeImportance(content string, memType MemType, metadata map[string]string) float64 {
var score float64
// Signal 1: Content length (longer = more information, up to a point)
lengthScore := math.Min(float64(len(content))/500.0, 1.0) * 0.10
// Signal 2: Named entity density (proper nouns, dates, numbers indicate facts)
entityCount := countNamedEntities(content)
entityScore := math.Min(float64(entityCount)/10.0, 1.0) * 0.15
// Signal 3: Decision/pattern keywords (these are high-value memory types)
keywordScore := 0.0
if memType == MemDecision { keywordScore = 0.3 }
if memType == MemPattern { keywordScore = 0.25 }
if memType == MemProcedure { keywordScore = 0.2 }
// Signal 4: Explicit priority hint from metadata
priorityScore := 0.0
if p, ok := metadata["priority"]; ok {
switch p {
case "critical": priorityScore = 0.3
case "high": priorityScore = 0.2
case "low": priorityScore = -0.1
}
}
// Signal 5: Source confidence (LLM extraction confidence, if available)
confidenceScore := 0.0
if conf, ok := metadata["confidence"]; ok {
if c, err := strconv.ParseFloat(conf, 64); err == nil {
confidenceScore = c * 0.2
}
}
score = lengthScore + entityScore + keywordScore + priorityScore + confidenceScore
return clamp(score, 0.05, 1.0) // Minimum importance to avoid immediate pruning
}
6.3 Temporal Decay Model¶
// EffectiveImportance returns the importance score after temporal decay.
// Implements Ebbinghaus-inspired forgetting curve.
func (m *Memory) EffectiveImportance(now time.Time) float64 {
daysSinceAccess := now.Sub(m.AccessTime).Hours() / 24.0
freshnessBonus := 1.0 / (1.0 + m.DecayRate*daysSinceAccess)
// Age-based decay (slow, gentle — older memories fade unless accessed)
daysSinceCreation := now.Sub(m.CreateTime).Hours() / 24.0
ageFactor := 1.0 / (1.0 + 0.001*daysSinceCreation) // Very slow: 50% at ~2.7 years
return m.Importance * freshnessBonus * ageFactor
}
// RecordAccess strengthens a memory when it's retrieved.
func (m *Memory) RecordAccess(now time.Time) {
m.AccessCount++
m.AccessTime = now
m.Importance = math.Min(1.0, m.Importance + 0.01) // Slight boost
m.DecayRate = math.Max(0.001, m.DecayRate * 0.95) // Slow decay further
}
Decay curve properties: - Decay rate 0.01: drops to 50% after ~100 days without access. - Decay rate 0.05: drops to 50% after ~20 days without access. - Each access slows decay by 5% (multiplied by 0.95). - Each access boosts importance by 0.01 (capped at 1.0). - Minimum decay rate is 0.001 (never fully static).
6.4 Memory Update (Merge) Rules¶
// MergeDecision determines whether a new memory should be a create, update, or merge.
type MergeDecision int
const (
MergeCreate MergeDecision = iota // Create new memory
MergeUpdate // Update existing memory (same ID)
MergeAppend // Append to existing content
MergeSupersede // Replace existing, link as superseded
)
func DecideMerge(existing *Memory, newContent string, similarity float64) MergeDecision {
switch {
case similarity < 0.70:
return MergeCreate // Different enough: new memory
case similarity < 0.85:
return MergeAppend // Related: append to existing
case similarity < 0.95:
return MergeUpdate // Very similar: replace content
default:
return MergeSupersede // Nearly identical: supersede with link
}
}
6.5 Conflict Resolution¶
When two memories contradict each other (detected via LLM in v0.3):
func ResolveContradiction(existing *Memory, contradictory *Memory) Resolution {
// 1. Trust recency: newer information is more likely correct
if contradictory.CreateTime.After(existing.CreateTime.Add(7 * 24 * time.Hour)) {
// Newer by more than a week → supersede old
return Resolution{Supersede: existing, Keep: contradictory}
}
// 2. Trust confidence: higher confidence source wins
if contradictory.Confidence > existing.Confidence + 0.2 {
return Resolution{Supersede: existing, Keep: contradictory}
}
// 3. Flag for human review
return Resolution{
FlagForReview: true,
Memories: []*Memory{existing, contradictory},
Reason: "Conflicting information with similar confidence",
}
}
6.6 Pruning Policy¶
// PruneDecayed removes memories below the effective importance threshold.
// Runs on `bower prune` or scheduled via `bower serve --prune-interval 24h`.
func (s *Service) PruneDecayed(threshold float64) (int, error) {
now := time.Now()
memories, _ := s.db.ListAll()
pruned := 0
for _, mem := range memories {
if mem.EffectiveImportance(now) < threshold {
// Safety: never prune decisions or procedures
if mem.Type == MemDecision || mem.Type == MemProcedure {
continue
}
// Add audit log entry before deleting
s.db.LogAudit(mem.ID, "pruned", map[string]any{
"effective_importance": mem.EffectiveImportance(now),
"threshold": threshold,
})
s.db.Delete(mem.ID)
s.vectorIndex.Remove(mem.ID)
pruned++
}
}
return pruned, nil
}
6.7 Memory Consolidation (Compression, v0.3)¶
// Consolidate merges groups of very similar memories into generalized patterns.
func (s *Service) Consolidate(ctx context.Context) (int, error) {
// 1. Cluster memories by embedding similarity (cosine > 0.90)
clusters := s.clusterBySimilarity(0.90)
// 2. For each cluster of 3+ similar memories, try to extract a pattern
merged := 0
for _, cluster := range clusters {
if len(cluster) < 3 {
continue
}
// Use LLM (fast model) to extract common pattern
pattern, err := s.extractPattern(ctx, cluster)
if err != nil {
continue
}
// Store the pattern, link original memories as examples
s.createPattern(pattern, cluster)
merged++
}
return merged, nil
}
7. RETRIEVAL PIPELINE¶
7.1 Multi-Stage Retrieval Architecture¶
Query String
│
▼
┌─────────────────────────────────────────────────────┐
│ STAGE 1: QUERY ANALYSIS (1ms) │
│ - Extract intent keywords │
│ - Classify query type: fact/decision/procedure │
│ - Detect temporal signals ("last week", "recent") │
│ - Generate embedding for query │
└──────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ STAGE 2: PARALLEL CANDIDATE RETRIEVAL (target: 30ms)│
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ BM25 │ │ Vector │ │ Graph │ │
│ │ (FTS5) │ │ (cosine)│ │ (1-hop) │ │
│ │ 200 │ │ 200 │ │ 50 │ │
│ │ cands │ │ cands │ │ cands │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ ~3ms │ ~15ms │ ~2ms │
└───────┼─────────────┼────────────┼──────────────────┘
│ │ │
└─────────────┼────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ STAGE 3: FUSION (1ms) │
│ - Reciprocal Rank Fusion (RRF, k=60) │
│ - Weighted: BM25=0.4, Vector=0.5, Graph=0.1 │
│ - Apply importance + decay bonus │
│ - Deduplicate across retrieval sources │
│ - Output: Top-50 ranked candidates │
└──────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ STAGE 4: RE-RANKING (optional, v0.2, 50ms) │
│ - Use Gemini 2.5 Flash to score top-20 │
│ - Cross-encoder style: "Rate relevance 1-5" │
│ - Only applied for ambiguous queries │
│ - Gate: skip if top result score > 0.80 │
└──────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ STAGE 5: CONTEXT BUDGETING (1ms) │
│ - Sort by final score, apply limit │
│ - Truncate content to budget per result │
│ - Ensure total context fits in window budget │
│ - Output: Final ranked list │
└──────────────────────┬──────────────────────────────┘
│
▼
Results JSON
7.2 Hybrid Search: RRF Formula¶
// ReciprocalRankFusion combines multiple ranked lists into one.
// k=60 is the standard parameter from the RRF paper.
func ReciprocalRankFusion(
bm25Results map[string]float64, // memoryID -> normalized BM25 score
vectorResults map[string]float64, // memoryID -> cosine similarity
graphResults map[string]float64, // memoryID -> graph boost score
bm25Weight, vectorWeight, graphWeight float64,
k int,
) []ScoredID {
scores := make(map[string]float64)
// RRF: sum_i weight / (k + (1 - score))
for id, score := range bm25Results {
if score > 0 {
scores[id] += bm25Weight / float64(k + int((1.0-score)*float64(k)))
}
}
for id, score := range vectorResults {
if score > 0 {
scores[id] += vectorWeight / float64(k + int((1.0-score)*float64(k)))
}
}
for id, boost := range graphResults {
scores[id] += graphWeight * boost * 0.01
}
// Convert to sorted slice
results := make([]ScoredID, 0, len(scores))
for id, score := range scores {
results = append(results, ScoredID{ID: id, Score: score})
}
sort.Slice(results, func(i, j int) bool { return results[i].Score > results[j].Score })
return results
}
7.3 BM25 Implementation (FTS5-specific)¶
// SearchFTS performs BM25-ranked full-text search via SQLite FTS5.
func (d *DB) SearchFTS(query string, limit int) ([]*types.Memory, error) {
// FTS5 query: escape special characters, support prefix queries with *
cleanQuery := sanitizeFTS5Query(query)
rows, err := d.db.Query(`
SELECT m.id, m.type, m.content, m.summary, m.importance, m.access_count,
m.create_time, m.access_time, m.decay_rate, m.tags, m.keywords, m.metadata,
bm25(memories_fts, 0.0, 10.0, 5.0) as bm25_score
FROM memories_fts f
JOIN memories m ON f.rowid = m.rowid
WHERE memories_fts MATCH ?
ORDER BY bm25_score
LIMIT ?
`, cleanQuery, limit)
if err != nil {
// FTS5 rejects some syntax; fall back to LIKE with importance sort
return d.searchFallback(query, limit)
}
defer rows.Close()
return scanMemoriesWithScore(rows)
}
// sanitizeFTS5Query escapes FTS5 special characters and adds prefix matching.
func sanitizeFTS5Query(q string) string {
// Remove characters that FTS5 treats as operators: ^ * " ( )
q = strings.NewReplacer(
"^", " ", "*", " ", "\"", " ", "(", " ", ")", " ",
).Replace(q)
// Split into terms, add prefix wildcard to each
terms := strings.Fields(q)
for i, t := range terms {
if len(t) > 2 && !strings.HasSuffix(t, "*") {
terms[i] = t + "*" // Prefix matching
}
}
return strings.Join(terms, " ")
}
7.4 Graph-Aware Retrieval (1-Hop Expansion)¶
// expandGraph enriches top candidates with their graph neighbors.
func (e *Engine) expandGraph(topResults []ScoredID) []ScoredID {
expanded := make(map[string]*graphCand)
for _, result := range topResults[:min(5, len(topResults))] {
// Get outgoing relations (this memory CAUSES, INFORMS, etc.)
rels, err := e.db.GetRelations(result.ID)
if err != nil {
continue
}
for _, rel := range rels {
if _, exists := expanded[rel.TargetID]; !exists {
expanded[rel.TargetID] = &graphCand{
score: result.Score * rel.Strength * 0.3,
sourceID: result.ID,
relType: rel.Type,
}
}
}
}
// Convert to results
var out []ScoredID
for id, gc := range expanded {
out = append(out, ScoredID{
ID: id,
Score: gc.score,
})
}
return out
}
7.5 Re-Ranking Strategy (v0.2)¶
// RerankWithLLM uses a fast LLM to re-rank the top candidates.
// Only invoked when the score gap between #1 and #2 is < 0.15 (ambiguous query).
func (e *Engine) RerankWithLLM(
ctx context.Context,
query string,
candidates []*types.Memory,
) ([]*types.Memory, error) {
if len(candidates) <= 1 {
return candidates, nil
}
// Ambiguity gate: skip re-ranking if top result is clearly best
if len(candidates) >= 2 {
// scores are embedded in the pipeline, assume we have them
// if gap > 0.15, skip re-ranking
}
// Build prompt: "Rate each passage's relevance to the query on a scale of 1-5"
prompt := buildRerankPrompt(query, candidates)
response, err := e.llmCaller.Call(ctx, prompt, ModelGeminiFlash2)
if err != nil {
return candidates, nil // Graceful degradation: return original order
}
scores := parseRerankScores(response, len(candidates))
sortByScore(candidates, scores)
return candidates, nil
}
func buildRerankPrompt(query string, candidates []*types.Memory) string {
var sb strings.Builder
sb.WriteString("Rate each passage's relevance to the query on a scale of 1-5.\n\n")
sb.WriteString(fmt.Sprintf("Query: %s\n\n", query))
for i, m := range candidates {
sb.WriteString(fmt.Sprintf("[%d] %s\n\n", i+1, truncate(m.Summary, 200)))
}
sb.WriteString("Output format: [N]:score (e.g., [1]:5, [2]:3, [3]:1)")
return sb.String()
}
7.6 Context Window Budgeting¶
// ContextBudget manages how much memory context to inject into the LLM window.
type ContextBudget struct {
MaxTokens int // Total token budget for memory context
MaxResults int // Maximum number of results
TokensPerResult int // Average tokens to allocate per result
}
func DefaultContextBudget() ContextBudget {
return ContextBudget{
MaxTokens: 2048, // ~10-15% of a typical 16K context window
MaxResults: 8,
TokensPerResult: 200,
}
}
// Allocate distributes the token budget across results, truncating as needed.
func (cb ContextBudget) Allocate(results []ScoredID, memLookup func(string) *Memory) []ContextItem {
budget := cb.MaxTokens
var items []ContextItem
for _, r := range results {
if len(items) >= cb.MaxResults || budget <= 0 {
break
}
mem := memLookup(r.ID)
if mem == nil {
continue
}
// Allocate tokens: give more to high-score results
allocation := min(cb.TokensPerResult, budget)
if r.Score > 0.8 {
allocation = min(cb.TokensPerResult*2, budget)
}
content := truncateToTokens(mem.Content, allocation)
items = append(items, ContextItem{
Memory: mem,
Content: content,
Score: r.Score,
})
budget -= allocation
}
return items
}
8. DYNAMIC MODEL ROUTING¶
8.1 Model Tier Definitions¶
TIER 3: Gemini 2.5 Flash Lite (fastest, cheapest, lowest quality)
Use: Initial retrieval candidate generation, keyword extraction
Latency: ~100ms
Cost: $0.01875 / 1M input tokens
Limits: 4000 RPM
TIER 2: Gemini 2.5 Flash (balanced speed/quality)
Use: Memory curation analysis, re-ranking, importance classification
Latency: ~300ms
Cost: $0.15 / 1M input tokens
Limits: 2000 RPM
TIER 1: Gemini 2.5 Pro (highest quality, slowest)
Use: Causal relationship extraction, pattern synthesis, contradiction detection
Latency: ~800ms
Cost: $1.25 / 1M input tokens
Limits: 200 RPM
TIER 0: Local (none) (no LLM call needed)
Use: BM25 search, embedding, cosine similarity, importance scoring
8.2 Model Selection Logic¶
// RouteModel determines which LLM tier to use for a given operation.
func RouteModel(op Operation, complexity ComplexityScore) ModelTier {
switch {
case op.CanBeLocal():
return TierLocal // Skip LLM entirely
case op == OpKeywordExtraction || op == OpTypeClassification:
return TierFlashLite // Simple classification tasks
case op == OpMemoryCuration || op == OpReRanking:
return TierFlash // Needs reasoning but limited context
case op == OpCausalExtraction || op == OpPatternSynthesis:
return TierPro // Deep reasoning on large context
case complexity == ComplexityHigh:
return TierPro // Fall back to best model for hard problems
default:
return TierFlash // Safe default
}
}
type ComplexityScore int
const (
ComplexityLow ComplexityScore = iota // Simple factual query
ComplexityMedium // Multi-part or ambiguous query
ComplexityHigh // Requires deep reasoning
)
func AssessComplexity(query string, resultCount int, scoreSpread float64) ComplexityScore {
score := ComplexityLow
if len(strings.Fields(query)) > 10 {
score = ComplexityMedium
}
if resultCount > 20 && scoreSpread < 0.1 {
score = ComplexityHigh // Many results with similar scores = ambiguous
}
if strings.Contains(query, "why") || strings.Contains(query, "explain") {
score = ComplexityHigh
}
return score
}
8.3 Fallback Strategy¶
// CallLLM attempts a model call with tiered fallback.
func CallLLM(ctx context.Context, prompt string, preferredTier ModelTier) (string, error) {
tiers := fallbackOrder(preferredTier)
for _, tier := range tiers {
result, err := callWithTier(ctx, prompt, tier)
if err == nil {
return result, nil
}
// Log fallback, continue to next tier
log.Printf("LLM tier %v failed: %v, falling back", tier, err)
}
return "", fmt.Errorf("all LLM tiers exhausted")
}
func fallbackOrder(preferred ModelTier) []ModelTier {
switch preferred {
case TierLocal:
return []ModelTier{TierLocal}
case TierFlashLite:
return []ModelTier{TierFlashLite, TierFlash, TierPro}
case TierFlash:
return []ModelTier{TierFlash, TierFlashLite, TierPro}
case TierPro:
return []ModelTier{TierPro, TierFlash, TierFlashLite}
default:
return []ModelTier{TierFlashLite, TierFlash, TierPro}
}
}
8.4 When NOT To Call Any LLM¶
Operations that are always model-free: - BM25 full-text search (FTS5 handles it) - Vector similarity search (pure math) - RRF score fusion (pure math) - Importance scoring (heuristic, not LLM-based by default) - Temporal decay (pure math) - Embedding generation (embedding model, not LLM)
Operations that optionally use LLM: - Memory type classification (heuristic first, LLM if unclear) - Keyword extraction (TF-IDF first, LLM for quality boost) - Re-ranking (only when score gap is ambiguous) - Content summarization (extractive first, LLM for quality)
Operations that always use LLM (v0.3+): - Causal relationship extraction - Pattern synthesis from clusters - Contradiction detection between memories
9. API DESIGN¶
9.1 Hermes MemoryProvider Contract¶
# plugins/hermes/provider.py
# This is the canonical implementation that all Hermes agents use.
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import Optional
import json, subprocess, os
@dataclass
class Memory:
id: str
content: str
summary: str
mem_type: str # fact|pattern|decision|procedure|context
importance: float
score: float # Relevance score from query
metadata: dict = field(default_factory=dict)
class RetrieverProvider:
"""
Hermes MemoryProvider implementation backed by the `rv` Go binary.
Communicates via subprocess (stdout JSON), same contract as ByteRover's `brv`.
"""
def __init__(self, binary_path: str = "rv", db_path: Optional[str] = None):
self.binary = binary_path
self.db_path = db_path
self._verify_binary()
def _verify_binary(self):
"""Fail fast if `rv` binary is not found or wrong version."""
result = subprocess.run(
[self.binary, "status"],
capture_output=True, text=True, timeout=5
)
if result.returncode != 0:
raise RuntimeError(f"rv binary not functional: {result.stderr}")
def _run(self, *args: str, timeout: int = 10) -> dict:
"""Execute bower command and return parsed JSON response."""
cmd = [self.binary] + list(args)
if self.db_path:
cmd.extend(["--db-path", self.db_path])
result = subprocess.run(
cmd, capture_output=True, text=True, timeout=timeout
)
if result.returncode != 0:
raise RuntimeError(f"rv failed: {result.stderr}")
data = json.loads(result.stdout)
if not data.get("success", False):
raise RuntimeError(f"rv error: {data.get('data', {}).get('error', 'unknown')}")
return data["data"]
# ── MemoryProvider ABC methods ──────────────────────────────────
def prefetch(self, query: str, limit: int = 5) -> list[Memory]:
"""
Called BEFORE each LLM turn. Returns context to inject.
Uses hybrid search (BM25 + vector + graph).
"""
data = self._run("query", query, "--limit", str(limit))
return [self._to_memory(r["memory"], r["score"]) for r in data["results"]]
def sync_turn(self, messages: list[dict]) -> None:
"""
Called AFTER each LLM turn. Extracts and stores memories.
Messages is a list of {"role": "...", "content": "..."} dicts.
"""
# Extract the assistant's response for curation
assistant_texts = [
m["content"] for m in messages
if m.get("role") == "assistant" and len(m.get("content", "")) > 100
]
for text in assistant_texts:
self._run("curate", text)
def query(self, query: str, limit: int = 10) -> list[Memory]:
"""Explicit memory search (used by agent tools)."""
return self.prefetch(query, limit)
def curate(self, content: str, mem_type: str = "", tags: list[str] = None) -> dict:
"""Explicit memory storage."""
args = ["curate", content]
if mem_type:
args.extend(["--type", mem_type])
if tags:
args.extend(["--tags", ",".join(tags)])
return self._run(*args)
def status(self) -> dict:
"""System statistics."""
return self._run("status")
# ── Internal helpers ────────────────────────────────────────────
def _to_memory(self, mem_data: dict, score: float) -> Memory:
return Memory(
id=mem_data["id"],
content=mem_data["content"],
summary=mem_data.get("summary", ""),
mem_type=mem_data.get("type", "fact"),
importance=mem_data.get("importance", 0.5),
score=score,
metadata=mem_data.get("metadata", {}),
)
9.2 CLI Interface Specification¶
Command format:
Standard output envelope (all commands):
Error output envelope:
{
"command": "query",
"success": false,
"data": {
"error": "descriptive error message",
"status": "error"
}
}
Commands:
| Command | Arguments | Options | Output (data field) |
|---|---|---|---|
bower query <text> |
Query text | --limit N (default 10, max 50), --fast (BM25 only), --format json\|text |
SearchResponse |
bower curate <text> |
Content text | --type fact\|pattern\|decision\|procedure\|context, --tags tag1,tag2 |
CurateResponse |
bower status |
None | None | StatusResponse |
bower prune [--threshold 0.05] |
None | --threshold N (default 0.05), --dry-run |
{pruned: N, remaining: M} |
bower consolidate |
None | --similarity 0.90 |
{merged: N, patterns_created: M} |
bower warm |
None | --all, --recent N |
{newly_cached: N, total: M} |
bower bench <type> |
latency|recall|resources | --corpus-size N, --iterations N |
Benchmark results |
bower serve |
None | --port N (default 8787), --prune-interval 24h |
HTTP daemon startup |
bower mcp |
None | None | MCP STDIO server |
bower migrate from-brv |
Source path | --path /path/to/brv/store |
{migrated: N, errors: []} |
bower version |
None | None | {version: "0.1.0", commit: "abc123"} |
9.3 HTTP API (v0.2+)¶
POST /api/v1/query
Request: {"query": "...", "limit": 10, "fast": false}
Response: SearchResponse
POST /api/v1/curate
Request: {"content": "...", "type": "fact", "tags": ["..."], "metadata": {}}
Response: CurateResponse
GET /api/v1/status
Response: StatusResponse
POST /api/v1/prune
Request: {"threshold": 0.05, "dry_run": false}
Response: {"pruned": N, "remaining": M}
GET /health
Response: {"status": "ok", "version": "0.1.0"}
9.4 MCP Server (v0.2+)¶
Tools exposed via MCP STDIO transport:
| Tool Name | Description | Parameters |
|---|---|---|
memory_search |
Search memories | query (string), limit (int, optional, default 10) |
memory_store |
Store a memory | content (string), type (string, optional), tags (array, optional) |
memory_status |
Get system stats | None |
memory_prune |
Prune decayed memories | threshold (float, optional, default 0.05) |
memory_consolidate |
Merge similar memories | None |
STDIO message format: JSON-RPC 2.0, same as all MCP servers.
10. PERFORMANCE TARGETS¶
10.1 Latency Budget Per Operation¶
OPERATION TARGET p50 p95 p99
──────────────────────────────────────────────────────────────────────
CLI startup (cold) <15ms 10ms 15ms 20ms
CLI startup (warm, after first) <10ms 6ms 10ms 12ms
bower query (cached embedding) <50ms 25ms 45ms 75ms
bower query (API embedding, cache miss) <100ms 60ms 85ms 120ms
bower query --fast (BM25 only) <15ms 6ms 12ms 18ms
bower curate (new memory) <100ms 55ms 90ms 150ms
bower curate (duplicate, merge) <120ms 65ms 100ms 180ms
bower status <5ms 2ms 4ms 8ms
bower prune (100K corpus) <500ms 350ms 500ms 800ms
bower consolidate (100K, 90% similarity) <2s 1.2s 1.8s 3s
EMBEDDING (single)
Gemini API (network warm) ~50ms 45ms 65ms 100ms
ONNX all-MiniLM-L6-v2 (384d) ~5ms 3ms 6ms 8ms
ONNX bge-base-en-v1.5 (768d) ~15ms 12ms 18ms 22ms
VECTOR SEARCH (brute-force, 8 shards)
1K vectors <1ms 0.5ms 1ms 1.5ms
10K vectors <5ms 2ms 4ms 6ms
50K vectors <15ms 8ms 12ms 18ms
100K vectors <30ms 15ms 25ms 35ms
500K vectors (int8) <50ms 35ms 48ms 60ms
BM25 SEARCH (FTS5)
Any corpus size <5ms 2ms 4ms 6ms
RRF FUSION <1ms 0.3ms 0.5ms 1ms
GRAPH EXPANSION (1-hop, top-5) <5ms 2ms 4ms 6ms
LLM RE-RANKING (Gemini 2.5 Flash) ~400ms 350ms 500ms 800ms
10.2 Memory Usage Targets¶
COMPONENT <10K corpus <100K corpus <500K corpus
─────────────────────────────────────────────────────────────────────────
Go runtime + heap ~20 MB ~30 MB ~50 MB
SQLite page cache (configured) ~32 MB ~32 MB ~32 MB
Vector index (float32) ~27 MB ~270 MB ~1.3 GB
Vector index (int8, v0.2) ~7 MB ~68 MB ~340 MB
Embedding cache (in-memory LRU) ~30 MB ~30 MB ~30 MB
Graph adjacency (estimated) ~5 MB ~50 MB ~250 MB
─────────────────────────────────────────────────────────────────────────
TOTAL RAM (float32) ~114 MB ~412 MB ~1.7 GB
TOTAL RAM (int8, v0.2) ~94 MB ~210 MB ~700 MB
10.3 Disk Usage Targets¶
COMPONENT PER-MEMORY <10K corpus <100K corpus
──────────────────────────────────────────────────────────────────────────
memories table (row) ~500 B avg
embeddings table (768d float32) ~3.1 KB
embeddings table (384d float32) ~1.6 KB
FTS5 index ~30% overhead
relations (per edge) ~80 B
embedding_cache (per entry) ~3.1 KB
audit_log ~100 B per event
──────────────────────────────────────────────────────────────────────────
TOTAL (.db file, 10K, 768d) ~40 MB
TOTAL (.db file, 100K, 768d) ~350 MB
TOTAL (.db file, 500K, 768d) ~1.7 GB
10.4 Concurrency Model¶
ARCHITECTURE: Single-writer, multi-reader
- SQLite: 1 write connection (MaxOpenConns=1), unlimited readers via WAL
- Vector index: sync.RWMutex (many parallel reads, exclusive write)
- Embedding cache: sync.RWMutex (many parallel reads, exclusive write)
- CLI: One process per invocation (no intra-process concurrency needed)
- HTTP daemon: goroutine per request, serialized by SQLite write lock
GOROUTINE POOL:
- Vector search shards: GOMAXPROCS (typically 4-16)
- Embedding API calls: semaphore of 10 goroutines max
- Background tasks (prune, warm): 1 goroutine each
11. PROJECT STRUCTURE¶
11.1 Monorepo Layout¶
retriever/
├── cmd/
│ └── rv/ # CLI binary (single binary for all modes)
│ ├── main.go # Entry point, command dispatch
│ ├── main_test.go
│ ├── query.go # bower query handler
│ ├── curate.go # bower curate handler
│ ├── status.go # bower status handler
│ ├── serve.go # bower serve HTTP daemon (v0.2)
│ ├── mcp.go # bower mcp STDIO server (v0.2)
│ ├── prune.go # bower prune command
│ ├── consolidate.go # bower consolidate command (v0.3)
│ ├── warm.go # bower warm command (v0.2)
│ ├── bench.go # bower bench commands (v0.2)
│ ├── migrate.go # bower migrate from-brv (v0.5)
│ └── version.go # bower version
│
├── pkg/
│ ├── types/ # Core data types (zero internal deps)
│ │ ├── types.go # Memory, SearchResult, CurateRequest, etc.
│ │ └── types_test.go
│ │
│ ├── config/ # Configuration management
│ │ ├── config.go # Config struct, Load/Save, defaults
│ │ └── config_test.go
│ │
│ ├── storage/ # SQLite persistence layer
│ │ ├── db.go # DB struct, Open, Close, migration
│ │ ├── memories.go # Memory CRUD operations
│ │ ├── embeddings.go # Embedding BLOB storage/retrieval
│ │ ├── relations.go # Graph edge CRUD
│ │ ├── fts.go # FTS5 search, query sanitization
│ │ ├── audit.go # Audit log operations (v0.2)
│ │ ├── migration.go # Schema versioning and migrations
│ │ ├── vector_utils.go # float32<->bytes conversion, cosine
│ │ └── storage_test.go
│ │
│ ├── embedding/ # Embedding generation pipeline
│ │ ├── embedder.go # Embedder interface + Cache interface
│ │ ├── gemini.go # Gemini API embedder
│ │ ├── openai.go # OpenAI API embedder (v0.2)
│ │ ├── voyage.go # Voyage API embedder (v0.2)
│ │ ├── onnx.go # Local ONNX embedder (v0.2)
│ │ ├── cache.go # InMemoryCache (LRU)
│ │ ├── cache_persist.go # SQLite-backed persistent cache (v0.2)
│ │ ├── provider.go # Auto-selection logic
│ │ └── embedding_test.go
│ │
│ ├── search/ # Search engine
│ │ ├── engine.go # Engine struct, Search, FastSearch
│ │ ├── bm25.go # FTS5 BM25 search wrapper
│ │ ├── vector_index.go # In-memory vector index
│ │ ├── vector_int8.go # int8-quantized index (v0.2)
│ │ ├── vector_hnsw.go # HNSW index (v0.4)
│ │ ├── hybrid.go # RRF fusion, multi-source merge
│ │ ├── rerank.go # LLM re-ranking (v0.2)
│ │ ├── graph_expansion.go # 1-hop neighbor retrieval
│ │ ├── context_budget.go # Context window allocation
│ │ ├── query_analysis.go # Intent extraction, type classification
│ │ └── search_test.go
│ │
│ ├── memory/ # High-level memory service
│ │ ├── memory.go # Type re-exports (backward compat)
│ │ ├── service.go # Service: Curate, Query, Status
│ │ ├── importance.go # Importance scoring algorithm
│ │ ├── decay.go # Temporal decay model
│ │ ├── merge.go # Duplicate detection + merge logic
│ │ ├── consolidation.go # Semantic compression (v0.3)
│ │ ├── conflict.go # Contradiction resolution (v0.3)
│ │ ├── types_classifier.go # Heuristic memory type classification
│ │ ├── keywords.go # Keyword extraction
│ │ └── service_test.go
│ │
│ ├── graph/ # Graph operations (v0.2+)
│ │ ├── causal.go # Causal chain extraction (v0.3)
│ │ ├── traversal.go # Multi-hop path traversal
│ │ ├── pattern.go # Pattern detection across graphs (v0.3)
│ │ └── graph_test.go
│ │
│ ├── model/ # LLM model routing (v0.2+)
│ │ ├── router.go # Model tier selection
│ │ ├── caller.go # LLM call abstraction
│ │ ├── fallback.go # Tiered fallback strategy
│ │ └── router_test.go
│ │
│ ├── mcp/ # MCP server (v0.2)
│ │ ├── server.go # STDIO MCP server
│ │ ├── tools.go # Tool definitions
│ │ ├── handlers.go # Tool handlers
│ │ └── mcp_test.go
│ │
│ └── http/ # HTTP server (v0.2)
│ ├── server.go # HTTP daemon setup
│ ├── handlers.go # Route handlers
│ ├── middleware.go # Logging, recovery, CORS
│ └── http_test.go
│
├── plugins/
│ └── hermes/ # Hermes MemoryProvider plugin
│ ├── __init__.py
│ ├── provider.py # RetrieverProvider implementing MemoryProvider ABC
│ ├── pyproject.toml # Python package metadata
│ ├── plugin.yaml # Hermes plugin manifest
│ └── tests/
│ ├── __init__.py
│ ├── test_provider.py
│ └── conftest.py
│
├── benchmarks/ # Performance benchmark suite
│ ├── latency_test.go # End-to-end latency benchmarks
│ ├── recall_test.go # Retrieval quality benchmarks
│ ├── resources_test.go # Memory/disk usage benchmarks
│ ├── concurrency_test.go # Concurrent access benchmarks
│ └── fixtures/ # Test corpora
│ ├── small_1k.json
│ ├── medium_10k.json
│ └── large_100k.json (generated)
│
├── scripts/ # Build and development scripts
│ ├── install.sh # One-line installer: curl ... | sh
│ ├── download_models.sh # Download ONNX models for offline use
│ ├── generate_corpus.go # Synthetic corpus generator
│ └── release.sh # Cross-compile + create GitHub release
│
├── docs/ # Documentation
│ ├── README.md # Project overview, quickstart
│ ├── INSTALL.md # Installation instructions
│ ├── CLI.md # CLI reference
│ ├── HERMES.md # Hermes integration guide
│ ├── MCP.md # MCP server setup (v0.2)
│ └── CONTRIBUTING.md # Development setup, PR process
│
├── go.mod # Go module definition
├── go.sum # Go dependency checksums
├── Makefile # Build targets
├── ARCHITECTURE.md # This file
├── .gitignore
└── LICENSE # MIT
11.2 Package Dependency Rules¶
Dependency direction (top-down, NO cycles):
cmd/rv
└── pkg/memory
├── pkg/search
│ ├── pkg/storage
│ │ └── pkg/types <-- leaf package, no internal deps
│ ├── pkg/embedding
│ │ └── pkg/types
│ └── pkg/graph
│ ├── pkg/storage
│ └── pkg/types
├── pkg/config
│ └── pkg/types
└── pkg/model
└── pkg/types
Rule: pkg/types MUST NOT import any other retriever package.
Rule: pkg/config MUST NOT import pkg/storage or pkg/embedding.
Rule: packages import only what they directly need (no transitive convenience imports).
11.3 Build System¶
Makefile targets:
.PHONY: build test bench clean install dist lint vet smoke-test
build: go build -ldflags="-s -w" -o build/rv ./cmd/rv
build-fast: go build -o build/rv ./cmd/rv # No strip, for dev iteration
install: go install ./cmd/rv # Install to $GOPATH/bin
test: go test -race -count=1 ./... # All tests with race detector
bench: go test -bench=. -benchmem ./... # All benchmarks
clean: rm -rf build/ && go clean -cache
dist: # Cross-compile for release
GOOS=linux GOARCH=amd64 go build -ldflags="-s -w" -o build/rv-linux-amd64 ./cmd/rv
GOOS=linux GOARCH=arm64 go build -ldflags="-s -w" -o build/rv-linux-arm64 ./cmd/rv
GOOS=darwin GOARCH=amd64 go build -ldflags="-s -w" -o build/rv-darwin-amd64 ./cmd/rv
GOOS=darwin GOARCH=arm64 go build -ldflags="-s -w" -o build/rv-darwin-arm64 ./cmd/rv
lint: golangci-lint run ./...
vet: go vet ./...
fmt: go fmt ./...
deps: go mod download && go mod tidy
smoke-test: build && build/rv status
coverage: go test -coverprofile=coverage.out ./... && go tool cover -html=coverage.out
CI pipeline (GitHub Actions):
# .github/workflows/ci.yml
name: CI
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with: { go-version: '1.26' }
- run: go test -race -count=1 ./...
- run: go test -coverprofile=coverage.out ./...
- run: go tool cover -func=coverage.out | grep total | awk '{print $3}' | sed 's/%//' | xargs -I{} sh -c 'test {} -ge 80'
bench:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-go@v5
with: { go-version: '1.26' }
- run: go test -bench=. -benchtime=1s ./...
- name: Check latency
run: |
go build -o build/rv ./cmd/rv
build/rv bench latency --corpus-size 1000 --iterations 50
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: golangci/golangci-lint-action@v6
11.4 Go Module¶
module github.com/retriever/memory
go 1.26.3
require (
github.com/mattn/go-sqlite3 v1.14.44 // SQLite driver (CGo)
)
Current dependencies (v1.0):
Future dependencies (when features land):
v0.2: TBD HNSW library or chromem-go integration // For >500K vector corpora
v0.3: github.com/schollz/progressbar // Progress bars for warm/consolidate
APPENDIX A: Type Definitions (Canonical Reference)¶
// pkg/types/types.go
package types
import "time"
type MemType string
const (
MemFact MemType = "fact" // Isolated piece of information
MemPattern MemType = "pattern" // Recurring observation or convention
MemDecision MemType = "decision" // Choice made with rationale
MemProcedure MemType = "procedure" // How to accomplish something
MemContext MemType = "context" // Situational background
)
type Memory struct {
ID string `json:"id"`
Type MemType `json:"type"`
Content string `json:"content"`
Summary string `json:"summary"`
Embedding []float32 `json:"embedding,omitempty"`
Importance float64 `json:"importance"`
AccessCount int `json:"access_count"`
CreateTime time.Time `json:"create_time"`
AccessTime time.Time `json:"access_time"`
DecayRate float64 `json:"decay_rate"`
SourceConvID string `json:"source_conv_id,omitempty"`
SupersedesID string `json:"supersedes_id,omitempty"`
Confidence float64 `json:"confidence"`
Tags []string `json:"tags"`
Keywords []string `json:"keywords"`
Relations []Relation `json:"relations,omitempty"`
Metadata map[string]string `json:"metadata,omitempty"`
}
type Relation struct {
TargetID string `json:"target_id"`
Type string `json:"type"` // causes|informs|contradicts|supersedes|example_of|prerequisite_for|led_to|related_to
Strength float64 `json:"strength"` // [0.0, 1.0]
}
type SearchResult struct {
Memory Memory `json:"memory"`
Score float64 `json:"score"`
BM25 float64 `json:"bm25,omitempty"`
Semantic float64 `json:"semantic,omitempty"`
Graph float64 `json:"graph,omitempty"`
}
type SearchResponse struct {
Query string `json:"query"`
Results []SearchResult `json:"results"`
TotalFound int `json:"total_found"`
TookMs float64 `json:"took_ms"`
ModelUsed string `json:"model_used,omitempty"`
}
type CurateRequest struct {
Content string `json:"content"`
Type MemType `json:"type,omitempty"`
Tags []string `json:"tags,omitempty"`
Keywords []string `json:"keywords,omitempty"`
Metadata map[string]string `json:"metadata,omitempty"`
}
type CurateResponse struct {
ID string `json:"id"`
Summary string `json:"summary"`
Type MemType `json:"type"`
TookMs float64 `json:"took_ms"`
IsUpdate bool `json:"is_update"`
}
type StatusResponse struct {
TotalMemories int `json:"total_memories"`
TotalSizeBytes int64 `json:"total_size_bytes"`
ByType map[MemType]int `json:"by_type"`
AvgImportance float64 `json:"avg_importance"`
IndexHealthy bool `json:"index_healthy"`
DBPath string `json:"db_path"`
VectorIndexTier string `json:"vector_index_tier,omitempty"`
}
APPENDIX B: Embedder Interface (Canonical Reference)¶
// pkg/embedding/embedder.go
package embedding
import "context"
type Embedder interface {
Embed(ctx context.Context, text string) ([]float32, error)
EmbedBatch(ctx context.Context, texts []string) ([][]float32, error)
Dimensions() int
ModelName() string
IsLocal() bool // Returns true for ONNX, false for API-based
}
type EmbeddingCache interface {
Get(hash string) ([]float32, bool)
Set(hash string, embedding []float32)
Size() int
Persist() error // Flush to SQLite if persistent (v0.2)
}
APPENDIX C: Version Compatibility Matrix¶
| Component | Minimum Version | Recommended | Notes |
|---|---|---|---|
| Go | 1.26.0 | 1.26.3 | Uses new range-over-func and iter patterns |
| SQLite | 3.35.0 | 3.44.0+ | FTS5 triggers, WAL mode, RETURNING clause |
| Python (Hermes plugin) | 3.10 | 3.12+ | Uses match/case, subprocess with capture_output |
| Gemini API | v1beta | v1beta | text-embedding-004 model |
| ONNX Runtime | 1.17.0 | 1.19.0+ | For local embedding fallback |
This is a living document. Every design choice has a rationale. When the rationale changes, update the document. When benchmarks contradict our assumptions, update the design. The architecture serves the latency budget -- not the other way around.