Skip to content

bowerbird — architecture

Status: ✅ Implemented — 8,629 lines across Go + Python. All Phase 1-3 complete. Date: 2026-06-02 Purpose: This document describes the architecture AS BUILT. The implementation matches this specification. See README.md for usage and quick start.


1. LANGUAGE CHOICE

1.1 Decision: Go (core) + Python (Hermes plugin)

Principle: Each language does what it does best.

Component Language Rationale
Core engine (storage, search, embedding, lifecycle) Go 1.26 Single binary, 5-15ms startup, goroutines, zero-runtime deployment
Hermes MemoryProvider plugin Python 3.11+ Hermes is Python; native abc.ABC subclass, no FFI
Optional TypeScript CLI wrapper TypeScript Only if Node.js ecosystem integration is needed; Go CLI is primary
Vector SIMD acceleration (optional) C via CGo Only if Go scalar vector math becomes the bottleneck at >500K vectors

1.2 Why Go Is The Correct Core Language

Startup time:    5-15ms    (vs 200-500ms Node, 50-100ms Python)
Concurrency:     Goroutines with work-stealing scheduler -- trivial fan-out
Deployment:      Single statically-linked binary, `curl ... | sh` installable
Memory:          No GC pauses >1ms with GOGC tuning
Ecosystem:       mature sqlite3 driver (mattn/go-sqlite3), CGo for SIMD escape hatch
Dev velocity:    2x faster than Rust for mutation-heavy memory lifecycle code

1.3 When To Introduce A Second Language

Hard gates, not preferences:

  1. Python: Only for plugins/hermes/. Never in the hot path. Python talks to Go via subprocess.run(["rv", "query", text]) -- the exact same contract ByteRover uses.
  2. C/Rust via CGo: Only when p95 vector search latency exceeds 50ms at corpus >500K. The Embedder and VectorIndex interfaces are already swappable. Do not introduce CGo before this threshold is crossed.
  3. TypeScript: Only if a TypeScript SDK is needed for npm ecosystem consumers. The Go CLI handles all operational use.

2. CORE ARCHITECTURE

2.1 Process Model: Embedded Library with CLI Frontend

Retriever is not a daemon. It is a library with a CLI frontend. This is the fundamental architectural constraint that differentiates it from Qdrant, ChromaDB, and other server-first memory systems.

                    ┌──────────────────────┐
                    │   bower (Go binary)     │
                    │                      │
  Hermes ──subprocess──►  cmd/rv/main.go  │
                    │       │              │
  Claude Code ──MCP──►  cmd/rv/mcp.go    │
                    │       │              │
  Web dashboard ──HTTP─►  cmd/rv/serve.go │
                    │       │              │
                    │   ◄───pkg/───────────┤
                    │   memory.Service     │
                    │   search.Engine      │
                    │   storage.DB         │
                    │   embedding.Embedder │
                    └──────────────────────┘

Three modes of operation:

Mode Invocation Use Case
One-shot CLI bower query "text" Hermes subprocess, shell pipelines, scripts
STDIO MCP bower mcp Claude Code, Cursor, any MCP host
HTTP daemon bower serve Web dashboards, multi-client, persistent

Startup sequence (one-shot CLI):

T+0ms:   Process start (kernel ELF load)
T+2ms:   main() parses args, loads config from ~/.retriever/config.json
T+5ms:   storage.Open() -- sqlite3 with WAL, ~32MB cache
T+8ms:   Load vector index from SQLite embeddings table into RAM (~130MB at 100K x 768D)
T+12ms:  GeminiEmbedder initialized (validates API key, no network call yet)
T+15ms:  Ready to serve query

Why not a daemon: Daemons require process supervision, port management, and add failure modes (is it running? which port? stale PID file?). ByteRover tried this and it was a pain point. The one-shot model with lazy initialization is simpler, more reliable, and still hits the sub-50ms latency target.

When to use bower serve (HTTP daemon): Only when you need concurrent multi-client access or want to avoid the 15ms startup per query. The daemon loads the vector index once and serves queries with 5-8ms startup overhead eliminated.

2.2 Communication with Hermes Agent

Hermes communicates with Retriever via the exact same subprocess contract that ByteRover (brv) uses. This is deliberate: drop-in replacement, zero Hermes code changes.

Hermes MemoryProvider contract:

# plugins/hermes/provider.py
class RetrieverProvider(MemoryProvider):
    """
    Implements Hermes MemoryProvider ABC.
    Communicates with bower binary via subprocess.
    """

    def prefetch(self, query: str) -> list[Memory]:
        """Called before each LLM turn. Returns context to inject."""
        result = subprocess.run(
            ["rv", "query", query, "--limit", "5"],
            capture_output=True, text=True, timeout=10
        )
        data = json.loads(result.stdout)
        return [self._to_memory(r) for r in data["data"]["results"]]

    def sync_turn(self, messages: list[Message]) -> None:
        """Called after each LLM turn. Extracts and stores memories."""
        text = self._extract_curatable_content(messages)
        if text:
            subprocess.run(
                ["rv", "curate", text],
                capture_output=True, text=True, timeout=10
            )

Key integration points:

  1. prefetch(query) -> list[Memory]: Called BEFORE the LLM sees the next user message. This is where predictive prefetching lives (v0.3).
  2. sync_turn(messages) -> None: Called AFTER the LLM responds. This is where memory extraction and curation happens.
  3. JSON over stdout: All CLI commands output a uniform JSON envelope: {"command": "query", "success": true, "data": {...}}

2.3 Multi-Language Bridging Strategy

The bridge is JSON over stdout/stdin. No gRPC, no Unix sockets, no FFI.

Python (Hermes)          Go (rv binary)           TypeScript (SDK)
     │                        │                        │
     │ subprocess.run()       │                        │
     ├── stdout JSON ────────►│                        │
     │◄── stdin JSON ─────────┤                        │
     │                        │                        │
     │                        ├── MCP STDIO ───────────►│
     │                        │◄── JSON-RPC ────────────┤

Why not FFI (CGo -> Python, or PyO3 -> Rust): - FFI adds build complexity (Python headers, shared library loading). - Subprocess isolation means a crash in rv cannot corrupt Hermes' memory space. - JSON serialization is <2ms for typical payload sizes. - The contract is versioned: bower --version reports SemVer, Hermes can gate features.


3. STORAGE ARCHITECTURE

3.1 Database Selection: SQLite 3.44+ with WAL, FTS5, and BLOB vector storage

Single file: ~/.retriever/memory.db

SQLite PRAGMA configuration (applied at connection open):

PRAGMA journal_mode=WAL;          -- Concurrent reads during write
PRAGMA synchronous=NORMAL;        -- Safe with WAL, 2x write speed
PRAGMA cache_size=-32000;         -- 32MB page cache
PRAGMA busy_timeout=5000;         -- 5s wait on lock (single writer is fine)
PRAGMA foreign_keys=ON;           -- Enforce referential integrity
PRAGMA mmap_size=268435456;       -- 256MB memory-mapped I/O
PRAGMA temp_store=MEMORY;         -- Temp tables in RAM

3.2 Schema Design

Table: memories (core memory records)

CREATE TABLE memories (
    id              TEXT PRIMARY KEY,              -- hex-encoded SHA-256[:16]
    type            TEXT NOT NULL DEFAULT 'fact',  -- fact|pattern|decision|procedure|context
    content         TEXT NOT NULL,                 -- Full memory text
    summary         TEXT NOT NULL DEFAULT '',      -- First-sentence extractive summary
    importance      REAL NOT NULL DEFAULT 0.5,     -- [0.0, 1.0] computed importance
    access_count    INTEGER NOT NULL DEFAULT 0,    -- Number of times retrieved
    create_time     INTEGER NOT NULL,              -- Unix milliseconds
    access_time     INTEGER NOT NULL,              -- Unix milliseconds, last retrieval
    decay_rate      REAL NOT NULL DEFAULT 0.01,    -- Per-memory decay factor
    source_conv_id  TEXT,                          -- Conversation that created this memory (v0.2)
    supersedes_id   TEXT,                          -- ID of memory this one replaces (v0.2)
    confidence      REAL NOT NULL DEFAULT 1.0,     -- [0.0, 1.0] source confidence (v0.2)
    tags            TEXT NOT NULL DEFAULT '[]',    -- JSON array of strings
    keywords        TEXT NOT NULL DEFAULT '[]',    -- JSON array of extracted keywords
    metadata        TEXT NOT NULL DEFAULT '{}'     -- JSON object for extensibility
);

CREATE INDEX idx_memories_type ON memories(type);
CREATE INDEX idx_memories_access_time ON memories(access_time);
CREATE INDEX idx_memories_importance ON memories(importance DESC);
CREATE INDEX idx_memories_create_time ON memories(create_time);

Virtual Table: memories_fts (BM25 full-text search via FTS5)

CREATE VIRTUAL TABLE memories_fts USING fts5(
    summary,
    content,
    tags,
    keywords,
    content='memories',
    content_rowid='rowid',
    tokenize='porter unicode61 remove_diacritics 2'
);

-- Triggers keep FTS5 synchronized with memories table
CREATE TRIGGER memories_fts_insert AFTER INSERT ON memories BEGIN
    INSERT INTO memories_fts(rowid, summary, content, tags, keywords)
    VALUES (new.rowid, new.summary, new.content, new.tags, new.keywords);
END;

CREATE TRIGGER memories_fts_delete AFTER DELETE ON memories BEGIN
    INSERT INTO memories_fts(memories_fts, rowid, summary, content, tags, keywords)
    VALUES ('delete', old.rowid, old.summary, old.content, old.tags, old.keywords);
END;

CREATE TRIGGER memories_fts_update AFTER UPDATE ON memories BEGIN
    INSERT INTO memories_fts(memories_fts, rowid, summary, content, tags, keywords)
    VALUES ('delete', old.rowid, old.summary, old.content, old.tags, old.keywords);
    INSERT INTO memories_fts(rowid, summary, content, tags, keywords)
    VALUES (new.rowid, new.summary, new.content, new.tags, new.keywords);
END;

Table: embeddings (vector storage as BLOBs)

CREATE TABLE embeddings (
    memory_id   TEXT PRIMARY KEY REFERENCES memories(id) ON DELETE CASCADE,
    embedding   BLOB NOT NULL,              -- float32[] as little-endian bytes (4 bytes per element)
    model       TEXT NOT NULL DEFAULT '',   -- e.g. "text-embedding-004"
    dimension   INTEGER NOT NULL DEFAULT 768,
    created_at  INTEGER NOT NULL DEFAULT (unixepoch('subsec') * 1000)
);

CREATE INDEX idx_embeddings_model ON embeddings(model);

BLOB encoding format: - Each float32 is stored as 4 bytes, little-endian. - For 768 dimensions: 768 * 4 = 3072 bytes per row. - At 100K memories: 100000 * 3072 = ~293 MB on disk, ~293 MB in RAM when loaded. - With int8 quantization: 100000 * 768 * 1 = ~73 MB.

Table: relations (typed, weighted graph edges)

CREATE TABLE relations (
    source_id   TEXT NOT NULL REFERENCES memories(id) ON DELETE CASCADE,
    target_id   TEXT NOT NULL REFERENCES memories(id) ON DELETE CASCADE,
    type        TEXT NOT NULL,              -- causes|informs|contradicts|supersedes|example_of|prerequisite_for|led_to|related_to
    strength    REAL NOT NULL DEFAULT 1.0,  -- [0.0, 1.0]
    created_at  INTEGER NOT NULL DEFAULT (unixepoch('subsec') * 1000),
    PRIMARY KEY (source_id, target_id, type)
);

CREATE INDEX idx_relations_source ON relations(source_id);
CREATE INDEX idx_relations_target ON relations(target_id);
CREATE INDEX idx_relations_type ON relations(type);

Table: embedding_cache (persistent embedding cache, survives restarts)

CREATE TABLE embedding_cache (
    content_hash TEXT PRIMARY KEY,          -- SHA-256 hex digest of input text
    embedding    BLOB NOT NULL,             -- float32[] as little-endian bytes
    model        TEXT NOT NULL,             -- which model produced this
    dimension    INTEGER NOT NULL,
    created_at   INTEGER NOT NULL,
    hit_count    INTEGER NOT NULL DEFAULT 0
);

CREATE INDEX idx_embedding_cache_model ON embedding_cache(model);

Table: audit_log (memory provenance, v0.2)

CREATE TABLE audit_log (
    id          INTEGER PRIMARY KEY AUTOINCREMENT,
    memory_id   TEXT NOT NULL REFERENCES memories(id) ON DELETE CASCADE,
    action      TEXT NOT NULL,              -- created|updated|merged|pruned|accessed|contradicted
    timestamp   INTEGER NOT NULL,
    details     TEXT NOT NULL DEFAULT '{}'  -- JSON: old values, reason, etc.
);

CREATE INDEX idx_audit_memory ON audit_log(memory_id);
CREATE INDEX idx_audit_timestamp ON audit_log(timestamp);

Table: schema_version (migration tracking)

CREATE TABLE schema_version (
    version     INTEGER PRIMARY KEY,
    applied_at  INTEGER NOT NULL,
    description TEXT NOT NULL
);

3.3 Index Strategy

Query Pattern Index Used Complexity
FTS5 keyword search memories_fts (virtual table, trigram tokenizer) O(log N) with bm25 scoring
Vector similarity (brute force) In-memory []float32 slices, no SQL index O(N * D) where N=memories, D=dims
Get memory by ID memories.id PRIMARY KEY O(1) B-tree lookup
List by type idx_memories_type O(log N)
Prune by decay idx_memories_access_time + idx_memories_importance O(N) scan, indexed sort
Graph neighbors idx_relations_source + idx_relations_target O(log N) per hop
Cache lookup embedding_cache.content_hash PRIMARY KEY O(1)

3.4 Data Layout On Disk

~/.retriever/
├── memory.db          -- SQLite database (all tables)
├── memory.db-wal      -- Write-Ahead Log (auto-managed by SQLite)
├── memory.db-shm      -- Shared memory for WAL index
└── config.json        -- User configuration

Expected sizes (empirically estimated):

Corpus Size DB File WAL (typical) RAM (float32 vectors) RAM (int8 vectors)
1K memories ~5 MB <1 MB ~3 MB ~0.8 MB
10K memories ~40 MB ~2 MB ~27 MB ~7 MB
100K memories ~350 MB ~5 MB ~270 MB ~68 MB
500K memories ~1.7 GB ~10 MB ~1.3 GB ~340 MB
1M memories ~3.5 GB ~20 MB ~2.7 GB ~680 MB

4. EMBEDDING ARCHITECTURE

4.1 Model Selection Matrix

                    QUALITY ───────────────────────────────►
                    MTEB Retrieval score (higher is better)

Model                      Score   Dims   Cost        Latency    Offline   Best For
──────────────────────────────────────────────────────────────────────────────────
Gemini text-embedding-004   80.3%   768   Free          ~50ms     No       Default: good enough, free, fast
OpenAI 3-large             89.3%  3072   $0.13/1M tk   ~50ms     No       Maximum quality when budget allows
OpenAI 3-small             85.1%   512   $0.02/1M tk   ~30ms     No       Budget API with good quality
Voyage-3-large             90.1%  2048   Paid          ~60ms     No       Best quality, highest cost
Voyage-3-lite              87.2%   512   Paid          ~30ms     No       Voyage budget option
all-MiniLM-L6-v2 (ONNX)   ~75%     384   Free          ~5ms      Yes      Local fallback, CI/CD, air-gapped
bge-small-en-v1.5 (ONNX)  ~78%     384   Free          ~8ms      Yes      Better local quality
bge-base-en-v1.5 (ONNX)   ~82%     768   Free          ~15ms     Yes      Best local quality
gte-small (ONNX)          ~79%     384   Free          ~6ms      Yes      General-purpose local

4.2 Embedding Provider Selection Logic

// Embedder selection waterfall -- evaluated at startup
func SelectEmbedder(cfg EmbeddingConfig) (Embedder, error) {
    // 1. Explicit provider in config
    switch cfg.Provider {
    case "gemini":
        return NewGeminiEmbedder(...)
    case "openai":
        return NewOpenAIEmbedder(...)
    case "voyage":
        return NewVoyageEmbedder(...)
    case "onnx":
        return NewONNXEmbedder(cfg.ONNXModelPath)
    case "local":
        return NewONNXEmbedder(autoDetectBestLocalModel())
    case "auto", "":
        // Fall through to auto-detection
    }

    // 2. Check for API keys in environment
    if key := os.Getenv("GEMINI_API_KEY"); key != "" {
        return NewGeminiEmbedder(GeminiEmbedderConfig{APIKey: key})
    }
    if key := os.Getenv("OPENAI_API_KEY"); key != "" {
        return NewOpenAIEmbedder(OpenAIEmbedderConfig{APIKey: key})
    }

    // 3. Fall back to local ONNX
    modelPath := autoDetectBestLocalModel()
    if modelPath != "" {
        return NewONNXEmbedder(modelPath)
    }

    return nil, fmt.Errorf("no embedding provider available: set GEMINI_API_KEY or install ONNX models")
}

func autoDetectBestLocalModel() string {
    // Search paths in order of quality preference:
    paths := []string{
        "~/.retriever/models/bge-base-en-v1.5.onnx",
        "~/.retriever/models/bge-small-en-v1.5.onnx",
        "~/.retriever/models/all-MiniLM-L6-v2.onnx",
    }
    for _, p := range paths {
        if _, err := os.Stat(p); err == nil {
            return p
        }
    }
    return ""
}

4.3 Batching Strategy

API embeddings (Gemini, OpenAI, Voyage): - Individual requests parallelized with a concurrency semaphore of 10 goroutines. - Gemini's API does not support true batch embedding. Each text is a separate HTTP request. Parallelism is the only optimization. - OpenAI supports batches of up to 2048 texts per request. Use this when the Embedder is OpenAI. - Voyage supports batches of up to 128 texts per request.

// Embedding concurrency configuration
const (
    MaxConcurrentAPIRequests = 10   // Limit to avoid rate-limit hammering
    APIBatchSize             = 100  // Texts per batch call when API supports it
    BatchTimeout             = 5 * time.Second
)

Local ONNX embeddings: - ONNX Runtime processes one text at a time (no native batching without dynamic axis). - Single-threaded, <10ms per text for 384-dim models. - For bulk operations (curate with many memories), run sequentially with progress.

4.4 Caching Architecture

Three-tier cache hierarchy:

Tier 1: In-memory LRU (sync.Map + ring buffer)
    ├── Size: 10,000 entries (~30 MB for 768-dim)
    ├── Eviction: LRU with TTL of 24 hours
    ├── Latency: ~100ns (map lookup)
    └── Hit rate: >95% for stable content

Tier 2: SQLite persistent cache (embedding_cache table)
    ├── Size: Unlimited (disk-backed)
    ├── Eviction: Manual (rv cache prune --older-than 30d)
    ├── Latency: ~0.5ms (indexed B-tree lookup)
    └── Survives restarts

Tier 3: Pre-compute during idle
    ├── After `bower curate`, a background goroutine warms the cache
    ├── For all uncached memories, embed and store
    └── Configurable: bower warm --all or automatic

Cache key derivation:

func CacheKey(text string, model string) string {
    h := sha256.Sum256([]byte(model + ":" + text))
    return hex.EncodeToString(h[:])
}

Cache invalidation rules: - Model change: all entries for old model are invalidated (different embedding space). - Content change: the cache key changes automatically (content-hash-based). - Never expires based on time alone for API embeddings (they are immutable). - TTL of 24h for in-memory LRU tier only (to bound RAM).

4.5 Local vs API Tradeoff Decisions

Factor API (Gemini) Local (ONNX)
Quality 80.3% MTEB ~75-78% MTEB
Latency (single) 50ms (network) 5-8ms (CPU)
Latency (batch 100) ~500ms (10 concurrent) ~500-800ms (sequential)
Cost Free (1500 RPM limit) Free (no limit)
Offline No Yes
Setup API key env var ~200MB model download
Dimension 768 384
Privacy Text leaves machine Everything local
Rate limit 1500 requests/minute Unlimited

Decision matrix for automatic selection:

Is GEMINI_API_KEY set?
├── YES → Use Gemini (free, good quality, fast enough)
│   └── Is network unreachable?
│       └── YES → Log warning, use local ONNX fallback
└── NO → Is OPENAI_API_KEY set?
    ├── YES → Use OpenAI (best quality, paid)
    └── NO → Use local ONNX (always available, no keys needed)

5.1 Algorithm Selection: Phased Strategy

Phase 1: Brute-force cosine similarity (MVP, <100K vectors)
    Algorithm:   Exhaustive scan with goroutine parallelism
    Partition:   Split vector space into GOMAXPROCS shards
    SIMD:        Pure Go scalar (no CGo dependency)
    Memory:      Float32 vectors in contiguous []float32 slices
    Latency:     ~12ms for 100K x 768D (with 8 goroutines)

Phase 2: int8 quantization (v0.2, <500K vectors)
    Algorithm:   Same brute-force, but on quantized int8 vectors
    Quantization: Per-dimension min/max scaling to [-127, 127]
    SIMD:        Go assembler or simsimd CGo bindings
    Memory:      4x reduction vs float32
    Latency:     ~8ms for 100K x 768D (smaller data, better cache)

Phase 3: HNSW index (v0.4, >500K vectors)
    Algorithm:   HNSW (Hierarchical Navigable Small World)
    Parameters:  M=16, efConstruction=200, efSearch=50
    Recall:      ~95% @ k=10 (vs brute-force baseline)
    Memory:      ~2x float32 vectors (graph edges + vectors)
    Latency:     <5ms regardless of corpus size
    Implementation: Custom pure-Go HNSW or chromem-go integration

5.2 Phase 1: Brute-Force Implementation

// VectorIndex is the in-memory vector search index.
type VectorIndex struct {
    mu       sync.RWMutex
    vectors  map[string][]float32  // memoryID -> embedding
    dims     int
    shards   int                    // Number of parallel shards
}

// Search finds the top-K most similar vectors to the query.
func (vi *VectorIndex) Search(query []float32, k int, minScore float32) []ScoredID {
    vi.mu.RLock()
    defer vi.mu.RUnlock()

    // Partition IDs across shards
    ids := make([]string, 0, len(vi.vectors))
    for id := range vi.vectors {
        ids = append(ids, id)
    }

    // Each shard computes top-K for its partition
    shardSize := (len(ids) + vi.shards - 1) / vi.shards
    results := make(chan []ScoredID, vi.shards)

    for s := 0; s < vi.shards; s++ {
        start := s * shardSize
        end := min(start+shardSize, len(ids))
        if start >= end {
            results <- nil
            continue
        }
        go func(partition []string) {
            shardResults := bruteForceTopK(vi.vectors, partition, query, k, minScore)
            results <- shardResults
        }(ids[start:end])
    }

    // Merge shard results
    all := make([]ScoredID, 0, k*vi.shards)
    for s := 0; s < vi.shards; s++ {
        shardResults := <-results
        all = append(all, shardResults...)
    }

    // Global top-K
    sort.Slice(all, func(i, j int) bool { return all[i].Score > all[j].Score })
    if len(all) > k {
        all = all[:k]
    }
    return all
}

func bruteForceTopK(
    vectors map[string][]float32,
    ids []string,
    query []float32,
    k int,
    minScore float32,
) []ScoredID {
    // Min-heap of size K
    heap := &boundedHeap{k: k}
    heap.items = make([]ScoredID, 0, k)

    for _, id := range ids {
        vec := vectors[id]
        score := cosineSimilarity(query, vec)
        if score >= minScore {
            heap.Push(ScoredID{ID: id, Score: score})
        }
    }
    return heap.items
}

5.3 Phase 2: int8 Quantization (v0.2)

// QuantizedVector stores an int8-quantized embedding.
type QuantizedVector struct {
    Min   float32   // Per-vector minimum value
    Max   float32   // Per-vector maximum value
    Data  []int8    // Quantized dimensions, length = dims
}

// Quantize converts float32 embedding to int8.
func Quantize(vec []float32) QuantizedVector {
    minVal, maxVal := float32(math.MaxFloat32), float32(-math.MaxFloat32)
    for _, v := range vec {
        if v < minVal { minVal = v }
        if v > maxVal { maxVal = v }
    }
    scale := 255.0 / (maxVal - minVal)
    data := make([]int8, len(vec))
    for i, v := range vec {
        data[i] = int8((v - minVal) * scale - 128)
    }
    return QuantizedVector{Min: minVal, Max: maxVal, Data: data}
}

// CosineSimilarityInt8 computes approximate cosine similarity on quantized vectors.
// Achieves ~0.99 correlation with float32 computation.
func CosineSimilarityInt8(a, b QuantizedVector) float32 {
    var dot int32
    var normA, normB int32
    for i := range a.Data {
        dot += int32(a.Data[i]) * int32(b.Data[i])
        normA += int32(a.Data[i]) * int32(a.Data[i])
        normB += int32(b.Data[i]) * int32(b.Data[i])
    }
    if normA == 0 || normB == 0 {
        return 0
    }
    return float32(dot) / (float32(normA) * float32(normB))
}

5.4 Latency Targets By Corpus Size

Corpus Size Phase Algorithm p50 Latency p95 Latency RAM
1K vectors 1 Brute-force float32 <1ms <2ms ~3 MB
10K vectors 1 Brute-force float32 ~2ms <5ms ~27 MB
50K vectors 1 Brute-force float32 ~8ms <15ms ~135 MB
100K vectors 1 Brute-force float32 (8 shards) ~12ms <25ms ~270 MB
100K vectors 2 Brute-force int8 (8 shards) ~6ms <12ms ~68 MB
500K vectors 2 Brute-force int8 (8 shards) ~30ms <50ms ~340 MB
500K vectors 3 HNSW float32 <4ms <8ms ~540 MB
1M+ vectors 3 HNSW + sharding <8ms <15ms ~2.7 GB

5.5 When To Trigger Phase Transition

func (vi *VectorIndex) shouldUpgrade() IndexTier {
    count := vi.Count()
    switch {
    case count < 100_000:
        return TierBruteForceFloat32
    case count < 500_000:
        return TierBruteForceInt8
    default:
        return TierHNSW
    }
}

// Automatic upgrade on Insert if threshold crossed
func (vi *VectorIndex) Insert(id string, vec []float32) error {
    vi.mu.Lock()
    defer vi.mu.Unlock()
    vi.vectors[id] = vec
    if len(vi.vectors) > 100_000 && vi.tier == TierBruteForceFloat32 {
        vi.upgradeToInt8() // async, non-blocking
    }
    if len(vi.vectors) > 500_000 && vi.tier == TierBruteForceInt8 {
        vi.upgradeToHNSW() // async, non-blocking
    }
    return nil
}

6. MEMORY LIFECYCLE

6.1 Complete Lifecycle State Machine

                    ┌─────────┐
                    │  Empty  │
                    └────┬────┘
                         │ bower curate "text"
                    ┌─────────┐
                    │ Created │──→ importance = computeImportance(content)
                    └────┬────┘    decay_rate = 0.01
                         │         confidence = 1.0 (or LLM-provided)
              ┌──────────┼──────────┐
              │          │          │
              ▼          ▼          ▼
         ┌────────┐ ┌────────┐ ┌─────────┐
         │ Active │ │Merged  │ │Superseded│
         │        │ │(duplicate│(explicit │
         │normal  │ │  found) │  replace)│
         └───┬────┘ └────┬───┘ └────┬────┘
             │           │          │
    ┌────────┼───────┐   │          │
    │        │       │   │          │
    ▼        ▼       ▼   ▼          ▼
┌───────┐ ┌──────┐ ┌──────┐  ┌──────────┐
│Accessed│ │Decayed│ │Pruned│  │Compressed │
│(boost) │ │(low  │ │(<thr-│  │(merged    │
│        │ │ eff) │ │ eshold)│  │into pattern)
└───┬────┘ └──┬───┘ └──┬───┘  └──────────┘
    │         │        │
    └─────────┘        │
    (loops back)       │
                   ┌───▼───┐
                   │Deleted │
                   └───────┘

6.2 Importance Scoring Algorithm

// ComputeImportance calculates the initial importance score for new content.
// Combines multiple signals into a [0.0, 1.0] score.
func ComputeImportance(content string, memType MemType, metadata map[string]string) float64 {
    var score float64

    // Signal 1: Content length (longer = more information, up to a point)
    lengthScore := math.Min(float64(len(content))/500.0, 1.0) * 0.10

    // Signal 2: Named entity density (proper nouns, dates, numbers indicate facts)
    entityCount := countNamedEntities(content)
    entityScore := math.Min(float64(entityCount)/10.0, 1.0) * 0.15

    // Signal 3: Decision/pattern keywords (these are high-value memory types)
    keywordScore := 0.0
    if memType == MemDecision { keywordScore = 0.3 }
    if memType == MemPattern { keywordScore = 0.25 }
    if memType == MemProcedure { keywordScore = 0.2 }

    // Signal 4: Explicit priority hint from metadata
    priorityScore := 0.0
    if p, ok := metadata["priority"]; ok {
        switch p {
        case "critical": priorityScore = 0.3
        case "high": priorityScore = 0.2
        case "low": priorityScore = -0.1
        }
    }

    // Signal 5: Source confidence (LLM extraction confidence, if available)
    confidenceScore := 0.0
    if conf, ok := metadata["confidence"]; ok {
        if c, err := strconv.ParseFloat(conf, 64); err == nil {
            confidenceScore = c * 0.2
        }
    }

    score = lengthScore + entityScore + keywordScore + priorityScore + confidenceScore
    return clamp(score, 0.05, 1.0) // Minimum importance to avoid immediate pruning
}

6.3 Temporal Decay Model

// EffectiveImportance returns the importance score after temporal decay.
// Implements Ebbinghaus-inspired forgetting curve.
func (m *Memory) EffectiveImportance(now time.Time) float64 {
    daysSinceAccess := now.Sub(m.AccessTime).Hours() / 24.0
    freshnessBonus := 1.0 / (1.0 + m.DecayRate*daysSinceAccess)

    // Age-based decay (slow, gentle — older memories fade unless accessed)
    daysSinceCreation := now.Sub(m.CreateTime).Hours() / 24.0
    ageFactor := 1.0 / (1.0 + 0.001*daysSinceCreation) // Very slow: 50% at ~2.7 years

    return m.Importance * freshnessBonus * ageFactor
}

// RecordAccess strengthens a memory when it's retrieved.
func (m *Memory) RecordAccess(now time.Time) {
    m.AccessCount++
    m.AccessTime = now
    m.Importance = math.Min(1.0, m.Importance + 0.01)  // Slight boost
    m.DecayRate = math.Max(0.001, m.DecayRate * 0.95)  // Slow decay further
}

Decay curve properties: - Decay rate 0.01: drops to 50% after ~100 days without access. - Decay rate 0.05: drops to 50% after ~20 days without access. - Each access slows decay by 5% (multiplied by 0.95). - Each access boosts importance by 0.01 (capped at 1.0). - Minimum decay rate is 0.001 (never fully static).

6.4 Memory Update (Merge) Rules

// MergeDecision determines whether a new memory should be a create, update, or merge.
type MergeDecision int
const (
    MergeCreate   MergeDecision = iota  // Create new memory
    MergeUpdate                         // Update existing memory (same ID)
    MergeAppend                         // Append to existing content
    MergeSupersede                      // Replace existing, link as superseded
)

func DecideMerge(existing *Memory, newContent string, similarity float64) MergeDecision {
    switch {
    case similarity < 0.70:
        return MergeCreate     // Different enough: new memory
    case similarity < 0.85:
        return MergeAppend     // Related: append to existing
    case similarity < 0.95:
        return MergeUpdate     // Very similar: replace content
    default:
        return MergeSupersede  // Nearly identical: supersede with link
    }
}

6.5 Conflict Resolution

When two memories contradict each other (detected via LLM in v0.3):

func ResolveContradiction(existing *Memory, contradictory *Memory) Resolution {
    // 1. Trust recency: newer information is more likely correct
    if contradictory.CreateTime.After(existing.CreateTime.Add(7 * 24 * time.Hour)) {
        // Newer by more than a week → supersede old
        return Resolution{Supersede: existing, Keep: contradictory}
    }

    // 2. Trust confidence: higher confidence source wins
    if contradictory.Confidence > existing.Confidence + 0.2 {
        return Resolution{Supersede: existing, Keep: contradictory}
    }

    // 3. Flag for human review
    return Resolution{
        FlagForReview: true,
        Memories:      []*Memory{existing, contradictory},
        Reason:        "Conflicting information with similar confidence",
    }
}

6.6 Pruning Policy

// PruneDecayed removes memories below the effective importance threshold.
// Runs on `bower prune` or scheduled via `bower serve --prune-interval 24h`.
func (s *Service) PruneDecayed(threshold float64) (int, error) {
    now := time.Now()
    memories, _ := s.db.ListAll()

    pruned := 0
    for _, mem := range memories {
        if mem.EffectiveImportance(now) < threshold {
            // Safety: never prune decisions or procedures
            if mem.Type == MemDecision || mem.Type == MemProcedure {
                continue
            }
            // Add audit log entry before deleting
            s.db.LogAudit(mem.ID, "pruned", map[string]any{
                "effective_importance": mem.EffectiveImportance(now),
                "threshold":            threshold,
            })
            s.db.Delete(mem.ID)
            s.vectorIndex.Remove(mem.ID)
            pruned++
        }
    }
    return pruned, nil
}

6.7 Memory Consolidation (Compression, v0.3)

// Consolidate merges groups of very similar memories into generalized patterns.
func (s *Service) Consolidate(ctx context.Context) (int, error) {
    // 1. Cluster memories by embedding similarity (cosine > 0.90)
    clusters := s.clusterBySimilarity(0.90)

    // 2. For each cluster of 3+ similar memories, try to extract a pattern
    merged := 0
    for _, cluster := range clusters {
        if len(cluster) < 3 {
            continue
        }
        // Use LLM (fast model) to extract common pattern
        pattern, err := s.extractPattern(ctx, cluster)
        if err != nil {
            continue
        }
        // Store the pattern, link original memories as examples
        s.createPattern(pattern, cluster)
        merged++
    }
    return merged, nil
}

7. RETRIEVAL PIPELINE

7.1 Multi-Stage Retrieval Architecture

Query String
┌─────────────────────────────────────────────────────┐
│ STAGE 1: QUERY ANALYSIS (1ms)                       │
│   - Extract intent keywords                         │
│   - Classify query type: fact/decision/procedure    │
│   - Detect temporal signals ("last week", "recent") │
│   - Generate embedding for query                    │
└──────────────────────┬──────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ STAGE 2: PARALLEL CANDIDATE RETRIEVAL (target: 30ms)│
│                                                     │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐          │
│  │  BM25    │  │  Vector  │  │  Graph   │          │
│  │  (FTS5)  │  │  (cosine)│  │  (1-hop) │          │
│  │  200     │  │  200     │  │  50      │          │
│  │  cands   │  │  cands   │  │  cands   │          │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘          │
│       │    ~3ms     │   ~15ms    │   ~2ms           │
└───────┼─────────────┼────────────┼──────────────────┘
        │             │            │
        └─────────────┼────────────┘
┌─────────────────────────────────────────────────────┐
│ STAGE 3: FUSION (1ms)                               │
│   - Reciprocal Rank Fusion (RRF, k=60)              │
│   - Weighted: BM25=0.4, Vector=0.5, Graph=0.1      │
│   - Apply importance + decay bonus                  │
│   - Deduplicate across retrieval sources            │
│   - Output: Top-50 ranked candidates                │
└──────────────────────┬──────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ STAGE 4: RE-RANKING (optional, v0.2, 50ms)          │
│   - Use Gemini 2.5 Flash to score top-20            │
│   - Cross-encoder style: "Rate relevance 1-5"       │
│   - Only applied for ambiguous queries              │
│   - Gate: skip if top result score > 0.80           │
└──────────────────────┬──────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ STAGE 5: CONTEXT BUDGETING (1ms)                    │
│   - Sort by final score, apply limit                │
│   - Truncate content to budget per result           │
│   - Ensure total context fits in window budget      │
│   - Output: Final ranked list                       │
└──────────────────────┬──────────────────────────────┘
                  Results JSON

7.2 Hybrid Search: RRF Formula

// ReciprocalRankFusion combines multiple ranked lists into one.
// k=60 is the standard parameter from the RRF paper.
func ReciprocalRankFusion(
    bm25Results map[string]float64,    // memoryID -> normalized BM25 score
    vectorResults map[string]float64,  // memoryID -> cosine similarity
    graphResults map[string]float64,   // memoryID -> graph boost score
    bm25Weight, vectorWeight, graphWeight float64,
    k int,
) []ScoredID {
    scores := make(map[string]float64)

    // RRF: sum_i weight / (k + (1 - score))
    for id, score := range bm25Results {
        if score > 0 {
            scores[id] += bm25Weight / float64(k + int((1.0-score)*float64(k)))
        }
    }
    for id, score := range vectorResults {
        if score > 0 {
            scores[id] += vectorWeight / float64(k + int((1.0-score)*float64(k)))
        }
    }
    for id, boost := range graphResults {
        scores[id] += graphWeight * boost * 0.01
    }

    // Convert to sorted slice
    results := make([]ScoredID, 0, len(scores))
    for id, score := range scores {
        results = append(results, ScoredID{ID: id, Score: score})
    }
    sort.Slice(results, func(i, j int) bool { return results[i].Score > results[j].Score })
    return results
}

7.3 BM25 Implementation (FTS5-specific)

// SearchFTS performs BM25-ranked full-text search via SQLite FTS5.
func (d *DB) SearchFTS(query string, limit int) ([]*types.Memory, error) {
    // FTS5 query: escape special characters, support prefix queries with *
    cleanQuery := sanitizeFTS5Query(query)

    rows, err := d.db.Query(`
        SELECT m.id, m.type, m.content, m.summary, m.importance, m.access_count,
               m.create_time, m.access_time, m.decay_rate, m.tags, m.keywords, m.metadata,
               bm25(memories_fts, 0.0, 10.0, 5.0) as bm25_score
        FROM memories_fts f
        JOIN memories m ON f.rowid = m.rowid
        WHERE memories_fts MATCH ?
        ORDER BY bm25_score
        LIMIT ?
    `, cleanQuery, limit)

    if err != nil {
        // FTS5 rejects some syntax; fall back to LIKE with importance sort
        return d.searchFallback(query, limit)
    }
    defer rows.Close()
    return scanMemoriesWithScore(rows)
}

// sanitizeFTS5Query escapes FTS5 special characters and adds prefix matching.
func sanitizeFTS5Query(q string) string {
    // Remove characters that FTS5 treats as operators: ^ * " ( )
    q = strings.NewReplacer(
        "^", " ", "*", " ", "\"", " ", "(", " ", ")", " ",
    ).Replace(q)
    // Split into terms, add prefix wildcard to each
    terms := strings.Fields(q)
    for i, t := range terms {
        if len(t) > 2 && !strings.HasSuffix(t, "*") {
            terms[i] = t + "*"  // Prefix matching
        }
    }
    return strings.Join(terms, " ")
}

7.4 Graph-Aware Retrieval (1-Hop Expansion)

// expandGraph enriches top candidates with their graph neighbors.
func (e *Engine) expandGraph(topResults []ScoredID) []ScoredID {
    expanded := make(map[string]*graphCand)

    for _, result := range topResults[:min(5, len(topResults))] {
        // Get outgoing relations (this memory CAUSES, INFORMS, etc.)
        rels, err := e.db.GetRelations(result.ID)
        if err != nil {
            continue
        }
        for _, rel := range rels {
            if _, exists := expanded[rel.TargetID]; !exists {
                expanded[rel.TargetID] = &graphCand{
                    score:      result.Score * rel.Strength * 0.3,
                    sourceID:   result.ID,
                    relType:    rel.Type,
                }
            }
        }
    }

    // Convert to results
    var out []ScoredID
    for id, gc := range expanded {
        out = append(out, ScoredID{
            ID:    id,
            Score: gc.score,
        })
    }
    return out
}

7.5 Re-Ranking Strategy (v0.2)

// RerankWithLLM uses a fast LLM to re-rank the top candidates.
// Only invoked when the score gap between #1 and #2 is < 0.15 (ambiguous query).
func (e *Engine) RerankWithLLM(
    ctx context.Context,
    query string,
    candidates []*types.Memory,
) ([]*types.Memory, error) {
    if len(candidates) <= 1 {
        return candidates, nil
    }

    // Ambiguity gate: skip re-ranking if top result is clearly best
    if len(candidates) >= 2 {
        // scores are embedded in the pipeline, assume we have them
        // if gap > 0.15, skip re-ranking
    }

    // Build prompt: "Rate each passage's relevance to the query on a scale of 1-5"
    prompt := buildRerankPrompt(query, candidates)
    response, err := e.llmCaller.Call(ctx, prompt, ModelGeminiFlash2)
    if err != nil {
        return candidates, nil // Graceful degradation: return original order
    }

    scores := parseRerankScores(response, len(candidates))
    sortByScore(candidates, scores)
    return candidates, nil
}

func buildRerankPrompt(query string, candidates []*types.Memory) string {
    var sb strings.Builder
    sb.WriteString("Rate each passage's relevance to the query on a scale of 1-5.\n\n")
    sb.WriteString(fmt.Sprintf("Query: %s\n\n", query))
    for i, m := range candidates {
        sb.WriteString(fmt.Sprintf("[%d] %s\n\n", i+1, truncate(m.Summary, 200)))
    }
    sb.WriteString("Output format: [N]:score (e.g., [1]:5, [2]:3, [3]:1)")
    return sb.String()
}

7.6 Context Window Budgeting

// ContextBudget manages how much memory context to inject into the LLM window.
type ContextBudget struct {
    MaxTokens     int     // Total token budget for memory context
    MaxResults    int     // Maximum number of results
    TokensPerResult int   // Average tokens to allocate per result
}

func DefaultContextBudget() ContextBudget {
    return ContextBudget{
        MaxTokens:      2048,   // ~10-15% of a typical 16K context window
        MaxResults:     8,
        TokensPerResult: 200,
    }
}

// Allocate distributes the token budget across results, truncating as needed.
func (cb ContextBudget) Allocate(results []ScoredID, memLookup func(string) *Memory) []ContextItem {
    budget := cb.MaxTokens
    var items []ContextItem

    for _, r := range results {
        if len(items) >= cb.MaxResults || budget <= 0 {
            break
        }
        mem := memLookup(r.ID)
        if mem == nil {
            continue
        }

        // Allocate tokens: give more to high-score results
        allocation := min(cb.TokensPerResult, budget)
        if r.Score > 0.8 {
            allocation = min(cb.TokensPerResult*2, budget)
        }

        content := truncateToTokens(mem.Content, allocation)
        items = append(items, ContextItem{
            Memory:  mem,
            Content: content,
            Score:   r.Score,
        })
        budget -= allocation
    }

    return items
}

8. DYNAMIC MODEL ROUTING

8.1 Model Tier Definitions

TIER 3: Gemini 2.5 Flash Lite  (fastest, cheapest, lowest quality)
    Use:      Initial retrieval candidate generation, keyword extraction
    Latency:  ~100ms
    Cost:     $0.01875 / 1M input tokens
    Limits:   4000 RPM

TIER 2: Gemini 2.5 Flash        (balanced speed/quality)
    Use:      Memory curation analysis, re-ranking, importance classification
    Latency:  ~300ms
    Cost:     $0.15 / 1M input tokens
    Limits:   2000 RPM

TIER 1: Gemini 2.5 Pro          (highest quality, slowest)
    Use:      Causal relationship extraction, pattern synthesis, contradiction detection
    Latency:  ~800ms
    Cost:     $1.25 / 1M input tokens
    Limits:   200 RPM

TIER 0: Local (none)            (no LLM call needed)
    Use:      BM25 search, embedding, cosine similarity, importance scoring

8.2 Model Selection Logic

// RouteModel determines which LLM tier to use for a given operation.
func RouteModel(op Operation, complexity ComplexityScore) ModelTier {
    switch {
    case op.CanBeLocal():
        return TierLocal  // Skip LLM entirely

    case op == OpKeywordExtraction || op == OpTypeClassification:
        return TierFlashLite  // Simple classification tasks

    case op == OpMemoryCuration || op == OpReRanking:
        return TierFlash  // Needs reasoning but limited context

    case op == OpCausalExtraction || op == OpPatternSynthesis:
        return TierPro  // Deep reasoning on large context

    case complexity == ComplexityHigh:
        return TierPro  // Fall back to best model for hard problems

    default:
        return TierFlash  // Safe default
    }
}

type ComplexityScore int
const (
    ComplexityLow    ComplexityScore = iota  // Simple factual query
    ComplexityMedium                         // Multi-part or ambiguous query
    ComplexityHigh                           // Requires deep reasoning
)

func AssessComplexity(query string, resultCount int, scoreSpread float64) ComplexityScore {
    score := ComplexityLow
    if len(strings.Fields(query)) > 10 {
        score = ComplexityMedium
    }
    if resultCount > 20 && scoreSpread < 0.1 {
        score = ComplexityHigh  // Many results with similar scores = ambiguous
    }
    if strings.Contains(query, "why") || strings.Contains(query, "explain") {
        score = ComplexityHigh
    }
    return score
}

8.3 Fallback Strategy

// CallLLM attempts a model call with tiered fallback.
func CallLLM(ctx context.Context, prompt string, preferredTier ModelTier) (string, error) {
    tiers := fallbackOrder(preferredTier)

    for _, tier := range tiers {
        result, err := callWithTier(ctx, prompt, tier)
        if err == nil {
            return result, nil
        }
        // Log fallback, continue to next tier
        log.Printf("LLM tier %v failed: %v, falling back", tier, err)
    }

    return "", fmt.Errorf("all LLM tiers exhausted")
}

func fallbackOrder(preferred ModelTier) []ModelTier {
    switch preferred {
    case TierLocal:
        return []ModelTier{TierLocal}
    case TierFlashLite:
        return []ModelTier{TierFlashLite, TierFlash, TierPro}
    case TierFlash:
        return []ModelTier{TierFlash, TierFlashLite, TierPro}
    case TierPro:
        return []ModelTier{TierPro, TierFlash, TierFlashLite}
    default:
        return []ModelTier{TierFlashLite, TierFlash, TierPro}
    }
}

8.4 When NOT To Call Any LLM

Operations that are always model-free: - BM25 full-text search (FTS5 handles it) - Vector similarity search (pure math) - RRF score fusion (pure math) - Importance scoring (heuristic, not LLM-based by default) - Temporal decay (pure math) - Embedding generation (embedding model, not LLM)

Operations that optionally use LLM: - Memory type classification (heuristic first, LLM if unclear) - Keyword extraction (TF-IDF first, LLM for quality boost) - Re-ranking (only when score gap is ambiguous) - Content summarization (extractive first, LLM for quality)

Operations that always use LLM (v0.3+): - Causal relationship extraction - Pattern synthesis from clusters - Contradiction detection between memories


9. API DESIGN

9.1 Hermes MemoryProvider Contract

# plugins/hermes/provider.py
# This is the canonical implementation that all Hermes agents use.

from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import Optional
import json, subprocess, os

@dataclass
class Memory:
    id: str
    content: str
    summary: str
    mem_type: str        # fact|pattern|decision|procedure|context
    importance: float
    score: float          # Relevance score from query
    metadata: dict = field(default_factory=dict)

class RetrieverProvider:
    """
    Hermes MemoryProvider implementation backed by the `rv` Go binary.
    Communicates via subprocess (stdout JSON), same contract as ByteRover's `brv`.
    """

    def __init__(self, binary_path: str = "rv", db_path: Optional[str] = None):
        self.binary = binary_path
        self.db_path = db_path
        self._verify_binary()

    def _verify_binary(self):
        """Fail fast if `rv` binary is not found or wrong version."""
        result = subprocess.run(
            [self.binary, "status"],
            capture_output=True, text=True, timeout=5
        )
        if result.returncode != 0:
            raise RuntimeError(f"rv binary not functional: {result.stderr}")

    def _run(self, *args: str, timeout: int = 10) -> dict:
        """Execute bower command and return parsed JSON response."""
        cmd = [self.binary] + list(args)
        if self.db_path:
            cmd.extend(["--db-path", self.db_path])

        result = subprocess.run(
            cmd, capture_output=True, text=True, timeout=timeout
        )

        if result.returncode != 0:
            raise RuntimeError(f"rv failed: {result.stderr}")

        data = json.loads(result.stdout)
        if not data.get("success", False):
            raise RuntimeError(f"rv error: {data.get('data', {}).get('error', 'unknown')}")

        return data["data"]

    # ── MemoryProvider ABC methods ──────────────────────────────────

    def prefetch(self, query: str, limit: int = 5) -> list[Memory]:
        """
        Called BEFORE each LLM turn. Returns context to inject.
        Uses hybrid search (BM25 + vector + graph).
        """
        data = self._run("query", query, "--limit", str(limit))
        return [self._to_memory(r["memory"], r["score"]) for r in data["results"]]

    def sync_turn(self, messages: list[dict]) -> None:
        """
        Called AFTER each LLM turn. Extracts and stores memories.
        Messages is a list of {"role": "...", "content": "..."} dicts.
        """
        # Extract the assistant's response for curation
        assistant_texts = [
            m["content"] for m in messages
            if m.get("role") == "assistant" and len(m.get("content", "")) > 100
        ]
        for text in assistant_texts:
            self._run("curate", text)

    def query(self, query: str, limit: int = 10) -> list[Memory]:
        """Explicit memory search (used by agent tools)."""
        return self.prefetch(query, limit)

    def curate(self, content: str, mem_type: str = "", tags: list[str] = None) -> dict:
        """Explicit memory storage."""
        args = ["curate", content]
        if mem_type:
            args.extend(["--type", mem_type])
        if tags:
            args.extend(["--tags", ",".join(tags)])
        return self._run(*args)

    def status(self) -> dict:
        """System statistics."""
        return self._run("status")

    # ── Internal helpers ────────────────────────────────────────────

    def _to_memory(self, mem_data: dict, score: float) -> Memory:
        return Memory(
            id=mem_data["id"],
            content=mem_data["content"],
            summary=mem_data.get("summary", ""),
            mem_type=mem_data.get("type", "fact"),
            importance=mem_data.get("importance", 0.5),
            score=score,
            metadata=mem_data.get("metadata", {}),
        )

9.2 CLI Interface Specification

Command format:

bower <command> [arguments] [options]

Standard output envelope (all commands):

{
  "command": "query",
  "success": true,
  "data": { ... },
  "timestamp": "2026-06-02T22:00:00Z"
}

Error output envelope:

{
  "command": "query",
  "success": false,
  "data": {
    "error": "descriptive error message",
    "status": "error"
  }
}

Commands:

Command Arguments Options Output (data field)
bower query <text> Query text --limit N (default 10, max 50), --fast (BM25 only), --format json\|text SearchResponse
bower curate <text> Content text --type fact\|pattern\|decision\|procedure\|context, --tags tag1,tag2 CurateResponse
bower status None None StatusResponse
bower prune [--threshold 0.05] None --threshold N (default 0.05), --dry-run {pruned: N, remaining: M}
bower consolidate None --similarity 0.90 {merged: N, patterns_created: M}
bower warm None --all, --recent N {newly_cached: N, total: M}
bower bench <type> latency|recall|resources --corpus-size N, --iterations N Benchmark results
bower serve None --port N (default 8787), --prune-interval 24h HTTP daemon startup
bower mcp None None MCP STDIO server
bower migrate from-brv Source path --path /path/to/brv/store {migrated: N, errors: []}
bower version None None {version: "0.1.0", commit: "abc123"}

9.3 HTTP API (v0.2+)

POST /api/v1/query
    Request:  {"query": "...", "limit": 10, "fast": false}
    Response: SearchResponse

POST /api/v1/curate
    Request:  {"content": "...", "type": "fact", "tags": ["..."], "metadata": {}}
    Response: CurateResponse

GET /api/v1/status
    Response: StatusResponse

POST /api/v1/prune
    Request:  {"threshold": 0.05, "dry_run": false}
    Response: {"pruned": N, "remaining": M}

GET /health
    Response: {"status": "ok", "version": "0.1.0"}

9.4 MCP Server (v0.2+)

Tools exposed via MCP STDIO transport:

Tool Name Description Parameters
memory_search Search memories query (string), limit (int, optional, default 10)
memory_store Store a memory content (string), type (string, optional), tags (array, optional)
memory_status Get system stats None
memory_prune Prune decayed memories threshold (float, optional, default 0.05)
memory_consolidate Merge similar memories None

STDIO message format: JSON-RPC 2.0, same as all MCP servers.


10. PERFORMANCE TARGETS

10.1 Latency Budget Per Operation

OPERATION                         TARGET     p50       p95       p99
──────────────────────────────────────────────────────────────────────
CLI startup (cold)                 <15ms     10ms      15ms      20ms
CLI startup (warm, after first)    <10ms      6ms      10ms      12ms
bower query (cached embedding)        <50ms     25ms      45ms      75ms
bower query (API embedding, cache miss) <100ms  60ms      85ms      120ms
bower query --fast (BM25 only)        <15ms      6ms      12ms      18ms
bower curate (new memory)             <100ms    55ms      90ms      150ms
bower curate (duplicate, merge)       <120ms    65ms      100ms     180ms
bower status                          <5ms       2ms       4ms       8ms
bower prune (100K corpus)             <500ms   350ms     500ms     800ms
bower consolidate (100K, 90% similarity) <2s   1.2s      1.8s      3s

EMBEDDING (single)
Gemini API (network warm)          ~50ms     45ms      65ms      100ms
ONNX all-MiniLM-L6-v2 (384d)       ~5ms       3ms       6ms      8ms
ONNX bge-base-en-v1.5 (768d)       ~15ms     12ms      18ms      22ms

VECTOR SEARCH (brute-force, 8 shards)
1K vectors                         <1ms     0.5ms      1ms       1.5ms
10K vectors                        <5ms      2ms       4ms       6ms
50K vectors                        <15ms     8ms      12ms      18ms
100K vectors                       <30ms    15ms      25ms      35ms
500K vectors (int8)                <50ms    35ms      48ms      60ms

BM25 SEARCH (FTS5)
Any corpus size                    <5ms      2ms       4ms       6ms

RRF FUSION                         <1ms     0.3ms      0.5ms     1ms

GRAPH EXPANSION (1-hop, top-5)     <5ms      2ms       4ms       6ms

LLM RE-RANKING (Gemini 2.5 Flash)  ~400ms   350ms     500ms     800ms

10.2 Memory Usage Targets

COMPONENT                          <10K corpus   <100K corpus   <500K corpus
─────────────────────────────────────────────────────────────────────────
Go runtime + heap                   ~20 MB        ~30 MB         ~50 MB
SQLite page cache (configured)      ~32 MB        ~32 MB         ~32 MB
Vector index (float32)              ~27 MB        ~270 MB        ~1.3 GB
Vector index (int8, v0.2)           ~7 MB         ~68 MB         ~340 MB
Embedding cache (in-memory LRU)     ~30 MB        ~30 MB         ~30 MB
Graph adjacency (estimated)         ~5 MB         ~50 MB         ~250 MB
─────────────────────────────────────────────────────────────────────────
TOTAL RAM (float32)                 ~114 MB       ~412 MB        ~1.7 GB
TOTAL RAM (int8, v0.2)             ~94 MB        ~210 MB        ~700 MB

10.3 Disk Usage Targets

COMPONENT                       PER-MEMORY    <10K corpus   <100K corpus
──────────────────────────────────────────────────────────────────────────
memories table (row)             ~500 B avg
embeddings table (768d float32)  ~3.1 KB
embeddings table (384d float32)  ~1.6 KB
FTS5 index                       ~30% overhead
relations (per edge)             ~80 B
embedding_cache (per entry)      ~3.1 KB
audit_log                        ~100 B per event
──────────────────────────────────────────────────────────────────────────
TOTAL (.db file, 10K, 768d)                    ~40 MB
TOTAL (.db file, 100K, 768d)                                ~350 MB
TOTAL (.db file, 500K, 768d)                                ~1.7 GB

10.4 Concurrency Model

ARCHITECTURE: Single-writer, multi-reader
    - SQLite: 1 write connection (MaxOpenConns=1), unlimited readers via WAL
    - Vector index: sync.RWMutex (many parallel reads, exclusive write)
    - Embedding cache: sync.RWMutex (many parallel reads, exclusive write)
    - CLI: One process per invocation (no intra-process concurrency needed)
    - HTTP daemon: goroutine per request, serialized by SQLite write lock

GOROUTINE POOL:
    - Vector search shards: GOMAXPROCS (typically 4-16)
    - Embedding API calls: semaphore of 10 goroutines max
    - Background tasks (prune, warm): 1 goroutine each

11. PROJECT STRUCTURE

11.1 Monorepo Layout

retriever/
├── cmd/
│   └── rv/                            # CLI binary (single binary for all modes)
│       ├── main.go                     # Entry point, command dispatch
│       ├── main_test.go
│       ├── query.go                    # bower query handler
│       ├── curate.go                   # bower curate handler
│       ├── status.go                   # bower status handler
│       ├── serve.go                    # bower serve HTTP daemon (v0.2)
│       ├── mcp.go                      # bower mcp STDIO server (v0.2)
│       ├── prune.go                    # bower prune command
│       ├── consolidate.go              # bower consolidate command (v0.3)
│       ├── warm.go                     # bower warm command (v0.2)
│       ├── bench.go                    # bower bench commands (v0.2)
│       ├── migrate.go                  # bower migrate from-brv (v0.5)
│       └── version.go                  # bower version
├── pkg/
│   ├── types/                          # Core data types (zero internal deps)
│   │   ├── types.go                    # Memory, SearchResult, CurateRequest, etc.
│   │   └── types_test.go
│   │
│   ├── config/                         # Configuration management
│   │   ├── config.go                   # Config struct, Load/Save, defaults
│   │   └── config_test.go
│   │
│   ├── storage/                        # SQLite persistence layer
│   │   ├── db.go                       # DB struct, Open, Close, migration
│   │   ├── memories.go                 # Memory CRUD operations
│   │   ├── embeddings.go               # Embedding BLOB storage/retrieval
│   │   ├── relations.go                # Graph edge CRUD
│   │   ├── fts.go                      # FTS5 search, query sanitization
│   │   ├── audit.go                    # Audit log operations (v0.2)
│   │   ├── migration.go                # Schema versioning and migrations
│   │   ├── vector_utils.go             # float32<->bytes conversion, cosine
│   │   └── storage_test.go
│   │
│   ├── embedding/                      # Embedding generation pipeline
│   │   ├── embedder.go                 # Embedder interface + Cache interface
│   │   ├── gemini.go                   # Gemini API embedder
│   │   ├── openai.go                   # OpenAI API embedder (v0.2)
│   │   ├── voyage.go                   # Voyage API embedder (v0.2)
│   │   ├── onnx.go                     # Local ONNX embedder (v0.2)
│   │   ├── cache.go                    # InMemoryCache (LRU)
│   │   ├── cache_persist.go            # SQLite-backed persistent cache (v0.2)
│   │   ├── provider.go                 # Auto-selection logic
│   │   └── embedding_test.go
│   │
│   ├── search/                         # Search engine
│   │   ├── engine.go                   # Engine struct, Search, FastSearch
│   │   ├── bm25.go                     # FTS5 BM25 search wrapper
│   │   ├── vector_index.go             # In-memory vector index
│   │   ├── vector_int8.go              # int8-quantized index (v0.2)
│   │   ├── vector_hnsw.go              # HNSW index (v0.4)
│   │   ├── hybrid.go                   # RRF fusion, multi-source merge
│   │   ├── rerank.go                   # LLM re-ranking (v0.2)
│   │   ├── graph_expansion.go          # 1-hop neighbor retrieval
│   │   ├── context_budget.go           # Context window allocation
│   │   ├── query_analysis.go           # Intent extraction, type classification
│   │   └── search_test.go
│   │
│   ├── memory/                         # High-level memory service
│   │   ├── memory.go                   # Type re-exports (backward compat)
│   │   ├── service.go                  # Service: Curate, Query, Status
│   │   ├── importance.go               # Importance scoring algorithm
│   │   ├── decay.go                    # Temporal decay model
│   │   ├── merge.go                    # Duplicate detection + merge logic
│   │   ├── consolidation.go            # Semantic compression (v0.3)
│   │   ├── conflict.go                 # Contradiction resolution (v0.3)
│   │   ├── types_classifier.go         # Heuristic memory type classification
│   │   ├── keywords.go                 # Keyword extraction
│   │   └── service_test.go
│   │
│   ├── graph/                          # Graph operations (v0.2+)
│   │   ├── causal.go                   # Causal chain extraction (v0.3)
│   │   ├── traversal.go                # Multi-hop path traversal
│   │   ├── pattern.go                  # Pattern detection across graphs (v0.3)
│   │   └── graph_test.go
│   │
│   ├── model/                          # LLM model routing (v0.2+)
│   │   ├── router.go                   # Model tier selection
│   │   ├── caller.go                   # LLM call abstraction
│   │   ├── fallback.go                 # Tiered fallback strategy
│   │   └── router_test.go
│   │
│   ├── mcp/                            # MCP server (v0.2)
│   │   ├── server.go                   # STDIO MCP server
│   │   ├── tools.go                    # Tool definitions
│   │   ├── handlers.go                 # Tool handlers
│   │   └── mcp_test.go
│   │
│   └── http/                           # HTTP server (v0.2)
│       ├── server.go                   # HTTP daemon setup
│       ├── handlers.go                 # Route handlers
│       ├── middleware.go               # Logging, recovery, CORS
│       └── http_test.go
├── plugins/
│   └── hermes/                         # Hermes MemoryProvider plugin
│       ├── __init__.py
│       ├── provider.py                 # RetrieverProvider implementing MemoryProvider ABC
│       ├── pyproject.toml              # Python package metadata
│       ├── plugin.yaml                 # Hermes plugin manifest
│       └── tests/
│           ├── __init__.py
│           ├── test_provider.py
│           └── conftest.py
├── benchmarks/                         # Performance benchmark suite
│   ├── latency_test.go                 # End-to-end latency benchmarks
│   ├── recall_test.go                  # Retrieval quality benchmarks
│   ├── resources_test.go               # Memory/disk usage benchmarks
│   ├── concurrency_test.go             # Concurrent access benchmarks
│   └── fixtures/                       # Test corpora
│       ├── small_1k.json
│       ├── medium_10k.json
│       └── large_100k.json (generated)
├── scripts/                            # Build and development scripts
│   ├── install.sh                      # One-line installer: curl ... | sh
│   ├── download_models.sh              # Download ONNX models for offline use
│   ├── generate_corpus.go              # Synthetic corpus generator
│   └── release.sh                      # Cross-compile + create GitHub release
├── docs/                               # Documentation
│   ├── README.md                       # Project overview, quickstart
│   ├── INSTALL.md                      # Installation instructions
│   ├── CLI.md                          # CLI reference
│   ├── HERMES.md                       # Hermes integration guide
│   ├── MCP.md                          # MCP server setup (v0.2)
│   └── CONTRIBUTING.md                 # Development setup, PR process
├── go.mod                              # Go module definition
├── go.sum                              # Go dependency checksums
├── Makefile                            # Build targets
├── ARCHITECTURE.md                     # This file
├── .gitignore
└── LICENSE                             # MIT

11.2 Package Dependency Rules

Dependency direction (top-down, NO cycles):

cmd/rv
  └── pkg/memory
      ├── pkg/search
      │   ├── pkg/storage
      │   │   └── pkg/types        <-- leaf package, no internal deps
      │   ├── pkg/embedding
      │   │   └── pkg/types
      │   └── pkg/graph
      │       ├── pkg/storage
      │       └── pkg/types
      ├── pkg/config
      │   └── pkg/types
      └── pkg/model
          └── pkg/types

Rule: pkg/types MUST NOT import any other retriever package.
Rule: pkg/config MUST NOT import pkg/storage or pkg/embedding.
Rule: packages import only what they directly need (no transitive convenience imports).

11.3 Build System

Makefile targets:

.PHONY: build test bench clean install dist lint vet smoke-test

build:        go build -ldflags="-s -w" -o build/rv ./cmd/rv
build-fast:   go build -o build/rv ./cmd/rv        # No strip, for dev iteration
install:      go install ./cmd/rv                   # Install to $GOPATH/bin
test:         go test -race -count=1 ./...          # All tests with race detector
bench:        go test -bench=. -benchmem ./...      # All benchmarks
clean:        rm -rf build/ && go clean -cache
dist:                                               # Cross-compile for release
    GOOS=linux   GOARCH=amd64 go build -ldflags="-s -w" -o build/rv-linux-amd64   ./cmd/rv
    GOOS=linux   GOARCH=arm64 go build -ldflags="-s -w" -o build/rv-linux-arm64   ./cmd/rv
    GOOS=darwin  GOARCH=amd64 go build -ldflags="-s -w" -o build/rv-darwin-amd64  ./cmd/rv
    GOOS=darwin  GOARCH=arm64 go build -ldflags="-s -w" -o build/rv-darwin-arm64  ./cmd/rv
lint:         golangci-lint run ./...
vet:          go vet ./...
fmt:          go fmt ./...
deps:         go mod download && go mod tidy
smoke-test:   build && build/rv status
coverage:     go test -coverprofile=coverage.out ./... && go tool cover -html=coverage.out

CI pipeline (GitHub Actions):

# .github/workflows/ci.yml
name: CI
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with: { go-version: '1.26' }
      - run: go test -race -count=1 ./...
      - run: go test -coverprofile=coverage.out ./...
      - run: go tool cover -func=coverage.out | grep total | awk '{print $3}' | sed 's/%//' | xargs -I{} sh -c 'test {} -ge 80'

  bench:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-go@v5
        with: { go-version: '1.26' }
      - run: go test -bench=. -benchtime=1s ./...
      - name: Check latency
        run: |
          go build -o build/rv ./cmd/rv
          build/rv bench latency --corpus-size 1000 --iterations 50

  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: golangci/golangci-lint-action@v6

11.4 Go Module

module github.com/retriever/memory

go 1.26.3

require (
    github.com/mattn/go-sqlite3 v1.14.44  // SQLite driver (CGo)
)

Current dependencies (v1.0):

github.com/mattn/go-sqlite3 v1.14.44   // SQLite driver (CGo, with FTS5)

Future dependencies (when features land):

v0.2: TBD HNSW library or chromem-go integration  // For >500K vector corpora
v0.3: github.com/schollz/progressbar              // Progress bars for warm/consolidate

APPENDIX A: Type Definitions (Canonical Reference)

// pkg/types/types.go

package types

import "time"

type MemType string
const (
    MemFact      MemType = "fact"       // Isolated piece of information
    MemPattern   MemType = "pattern"    // Recurring observation or convention
    MemDecision  MemType = "decision"   // Choice made with rationale
    MemProcedure MemType = "procedure"  // How to accomplish something
    MemContext   MemType = "context"    // Situational background
)

type Memory struct {
    ID             string            `json:"id"`
    Type           MemType           `json:"type"`
    Content        string            `json:"content"`
    Summary        string            `json:"summary"`
    Embedding      []float32         `json:"embedding,omitempty"`
    Importance     float64           `json:"importance"`
    AccessCount    int               `json:"access_count"`
    CreateTime     time.Time         `json:"create_time"`
    AccessTime     time.Time         `json:"access_time"`
    DecayRate      float64           `json:"decay_rate"`
    SourceConvID   string            `json:"source_conv_id,omitempty"`
    SupersedesID   string            `json:"supersedes_id,omitempty"`
    Confidence     float64           `json:"confidence"`
    Tags           []string          `json:"tags"`
    Keywords       []string          `json:"keywords"`
    Relations      []Relation        `json:"relations,omitempty"`
    Metadata       map[string]string `json:"metadata,omitempty"`
}

type Relation struct {
    TargetID string  `json:"target_id"`
    Type     string  `json:"type"`     // causes|informs|contradicts|supersedes|example_of|prerequisite_for|led_to|related_to
    Strength float64 `json:"strength"` // [0.0, 1.0]
}

type SearchResult struct {
    Memory   Memory  `json:"memory"`
    Score    float64 `json:"score"`
    BM25     float64 `json:"bm25,omitempty"`
    Semantic float64 `json:"semantic,omitempty"`
    Graph    float64 `json:"graph,omitempty"`
}

type SearchResponse struct {
    Query      string         `json:"query"`
    Results    []SearchResult `json:"results"`
    TotalFound int            `json:"total_found"`
    TookMs     float64        `json:"took_ms"`
    ModelUsed  string         `json:"model_used,omitempty"`
}

type CurateRequest struct {
    Content  string            `json:"content"`
    Type     MemType           `json:"type,omitempty"`
    Tags     []string          `json:"tags,omitempty"`
    Keywords []string          `json:"keywords,omitempty"`
    Metadata map[string]string `json:"metadata,omitempty"`
}

type CurateResponse struct {
    ID       string  `json:"id"`
    Summary  string  `json:"summary"`
    Type     MemType `json:"type"`
    TookMs   float64 `json:"took_ms"`
    IsUpdate bool    `json:"is_update"`
}

type StatusResponse struct {
    TotalMemories  int             `json:"total_memories"`
    TotalSizeBytes int64           `json:"total_size_bytes"`
    ByType         map[MemType]int `json:"by_type"`
    AvgImportance  float64         `json:"avg_importance"`
    IndexHealthy   bool            `json:"index_healthy"`
    DBPath         string          `json:"db_path"`
    VectorIndexTier string         `json:"vector_index_tier,omitempty"`
}

APPENDIX B: Embedder Interface (Canonical Reference)

// pkg/embedding/embedder.go

package embedding

import "context"

type Embedder interface {
    Embed(ctx context.Context, text string) ([]float32, error)
    EmbedBatch(ctx context.Context, texts []string) ([][]float32, error)
    Dimensions() int
    ModelName() string
    IsLocal() bool       // Returns true for ONNX, false for API-based
}

type EmbeddingCache interface {
    Get(hash string) ([]float32, bool)
    Set(hash string, embedding []float32)
    Size() int
    Persist() error      // Flush to SQLite if persistent (v0.2)
}

APPENDIX C: Version Compatibility Matrix

Component Minimum Version Recommended Notes
Go 1.26.0 1.26.3 Uses new range-over-func and iter patterns
SQLite 3.35.0 3.44.0+ FTS5 triggers, WAL mode, RETURNING clause
Python (Hermes plugin) 3.10 3.12+ Uses match/case, subprocess with capture_output
Gemini API v1beta v1beta text-embedding-004 model
ONNX Runtime 1.17.0 1.19.0+ For local embedding fallback

This is a living document. Every design choice has a rationale. When the rationale changes, update the document. When benchmarks contradict our assumptions, update the design. The architecture serves the latency budget -- not the other way around.