claudemem: Because grep Wasn't Cutting It
Local Semantic Code Search for Claude Code and AI Agents
TypeScript
Bun
Tree-sitter
LanceDB
OpenRouter
PageRank
Hybrid Search
MCP
The Problem with grep
I've been writing code for over twenty years. Last month I spent forty-five minutes searching for a function I wrote myself three weeks earlier.
Forty-five minutes. For code I wrote.
The function was called silentTokenRefresh. Of course it was. I'd typed "token" into grep. Got 847 results. None of them were the actual token handler. I'd scrolled right past it twice before giving up and asking a colleague who remembered the name.
I started timing these searches. Three hours hunting for "the thing that validates webhook signatures." Two hours finding where we actually persist user preferences. An entire afternoon tracing why a settings change wasn't propagating—turned out there were four different settings services, and I was looking at the wrong three.
grep doesn't care about my memory. It doesn't know that when I search for "auth" I probably mean the token refresh flow, not the 200 files that happen to contain the word "authentication" in a comment.
I was tired of feeling stupid in codebases I wrote myself.
The Numbers
| Metric | Result |
|---|---|
| Search accuracy (NDCG) | 175% vs baseline (voyage-code-3) |
| Embedding models supported | 15+ (cloud + local) |
| LLM summarizers benchmarked | 18 models across 6 evaluation methods |
| Languages supported | 8 (TypeScript, Python, Go, Rust, C/C++, Java) |
| Privacy | 100% local - nothing leaves your machine |
| Index cost | ~$0.01 per 1M tokens (cloud) / $0 (local) |
| Distribution | npm, Homebrew, shell installer |
| License | MIT (fully open source) |
Semantic code search that finds what you mean, not what you type. 20-minute searches now take 3 seconds. 50+ repos. 100% local. My grep usage dropped to near zero.
The Architecture: Not Just Another Search Tool
claudemem builds a semantic graph of your codebase. Not a text index. Not a fuzzy matcher. An actual understanding of what calls what and why.
The indexer parses every file into an AST. Functions, classes, methods, calls—all of it becomes nodes in a graph. Then it runs PageRank. The same algorithm Google used to find important web pages, but pointed at your code.
This matters more than it sounds. When you search, results come back ranked by architectural importance. The core authentication handler ranks higher than the seventeen wrapper functions that call it. You find the heart of your system in seconds, not hours.
Semantic search sits on top. When I search "refresh tokens silently," it doesn't just pattern-match those words. It understands token operations, refresh patterns, silent execution. It finds silentTokenRefresh even though my query didn't match the exact name.
The caller/callee analysis closed the loop. Before touching any function, I see exactly what depends on it. Last month this caught a disaster—I was about to "clean up" a function that looked unused. claudemem showed nine callers. Nine were dead code we'd been maintaining for two years. The tenth was the payment processor.
We'd been maintaining code nobody called for two years. And we almost deleted code that would have broken payments.
This isn't a research project. It's the architecture I needed to stop wasting hours every week. Everything stays local in .claudemem/ in your project. Nothing goes to a server. Ever.
It changed how we work.
Embedding Model Benchmarks: We Tested Everything
Which embedding model is best for code search? We didn't guess. We measured.
Run claudemem benchmark to test models on your actual codebase. Here's what we found on real code search tasks:
| Model | Speed | NDCG | Cost | Notes |
|---|---|---|---|---|
| voyage-code-3 | 4.5s | 175% | $0.007 | Best quality |
| gemini-embedding-001 | 2.9s | 170% | $0.007 | Great free option |
| voyage-3-large | 1.8s | 164% | $0.007 | Fast & accurate |
| voyage-3.5-lite | 1.2s | 163% | $0.001 | Best value (default) |
| voyage-3.5 | 1.2s | 150% | $0.002 | Fastest |
| mistral-embed | 16.6s | 150% | $0.006 | Slow |
| text-embedding-3-small | 3.0s | 141% | $0.001 | Decent |
| text-embedding-3-large | 3.1s | 141% | $0.005 | Not worth it |
| all-minilm-l6-v2 | 2.7s | 128% | $0.0001 | Cheapest (local) |
Best Quality: voyage-code-3 (175% NDCG)
Best Value: voyage-3.5-lite (163% NDCG, $0.001) - this is the default
Fastest: voyage-3.5 (1.2s)
Free/Local: all-minilm-l6-v2 via Ollama
LLM Summarizer Benchmarks: Which Model Describes Code Best?
claudemem generates natural language descriptions of code chunks. These descriptions power semantic search. Better descriptions = better search results.
We benchmarked 18 LLM models across 6 different evaluation methods:
Evaluation Methods
- Judge (Pointwise): LLM rates each description's quality 0-10
- Judge (Pairwise): LLM picks better description in head-to-head comparisons
- Contrastive Discrimination: Can the description identify its source code among alternatives?
- Retrieval: Does using the description improve search accuracy?
- Downstream Tasks: Does the description help with actual coding tasks?
- Self-Evaluation: Model rates its own outputs (calibration check)
| Model | Retrieval | Contrastive | Judge | Overall |
|---|---|---|---|---|
| gpt-5.1-codex-max | 23% | 83% | 78% | 57% |
| nova-premier-v1 | 27% | 79% | 51% | 56% |
| qwen3-235b-a22b-2507 | 13% | 92% | 79% | 55% |
| opus | 16% | 80% | 71% | 54% |
| deepseek-v3.2 | 13% | 82% | 74% | 52% |
| haiku | 7% | 82% | 69% | 49% |
Fastest: haiku (3.7s avg latency)
Best Quality: gpt-5.1-codex-max (57% overall)
Best Value: deepseek-v3.2 (52% quality, low cost)
Symbol Graph: Beyond Search
Search is table stakes. The real power is the symbol graph.
claudemem tracks every reference between symbols. It computes PageRank scores based on how central each function/class is to your codebase. This enables:
Dead Code Detection
claudemem dead-code Finds symbols with zero callers + low PageRank + not exported. Great for cleaning up unused code.
Test Coverage Gaps
claudemem test-gaps Finds high-PageRank symbols not called by any test file. Prioritize what to test next.
Change Impact Analysis
claudemem impact FileTracker Shows all transitive callers, grouped by file. Understand the blast radius before refactoring.
Symbol Navigation
claudemem symbol handleAuth # find definition
claudemem callers handleAuth # what calls this?
claudemem callees handleAuth # what does this call?
claudemem context handleAuth # all of the above Self-Learning System (Experimental)
This is where it gets interesting.
Traditional ML validation assumes millions of samples and explicit labels. Our context is different: 50-500 sessions per project, no user ratings, data stays local.
claudemem's self-learning system uses implicit feedback signals:
| Signal Type | How Detected | Weight |
|---|---|---|
| Lexical Correction | User says "no", "wrong", "actually" | 0.30 |
| Strategy Pivot | Sudden change in tool usage after failure | 0.20 |
| Overwrite | User edits same file region agent modified | 0.35 |
| Reask | User repeats similar prompt | 0.15 |
The strongest signal: Code Survival Rate
code_survival_rate = lines_kept / lines_written_by_agent If the user keeps the agent's code in their git commit, the agent did well. If they rewrite everything, it failed.
What Gets Generated
- Skills: Automatable sequences become slash commands
- Subagents: Error clusters become specialized agents
- Prompt Optimizations: Correction patterns become prompt additions
Safety Validation
Changes are tested against a Red Team before deployment:
- Red Team: Attacks with edge cases, malformed data, injections
- Blue Team: Defends with validation, sanitization, limits
- Safety Score: > 0.90 = Deploy, < 0.70 = Reject
Using with Claude Code
Run claudemem as an MCP server:
claudemem --mcp Then Claude Code can use these tools:
search_code- semantic search (auto-indexes changes)index_codebase- manual full reindexget_status- check what's indexedclear_index- start fresh
Detective Agents
Install the code-analysis plugin for pre-built agents that use claudemem:
- developer-detective: Trace implementations, find usages
- architect-detective: Analyze architecture, find patterns
- tester-detective: Find test gaps, coverage analysis
- debugger-detective: Trace errors, find bug sources
Documentation Indexing
claudemem can automatically fetch and index documentation for your dependencies. Search across both your code AND the frameworks you use.
Sources (in priority order):
- Context7: 6000+ libraries, versioned API docs (free API key)
- llms.txt: Official AI-optimized docs (Vue, Nuxt, Langchain, etc.)
- DevDocs: Offline documentation for 100+ languages
claudemem docs fetch # fetch docs for all detected dependencies
claudemem docs fetch react vue # fetch specific libraries
claudemem docs status # show indexed docs How We Compare
| Feature | claudemem | Context | Greptile | Amp |
|---|---|---|---|---|
| Cost | Free / MIT | Free (needs API) | $30/dev/mo | $1,000+ min |
| Privacy | 100% Local | Cloud default | Cloud | Cloud only |
| CLI Tool | Yes | No | No | No |
| Symbol Graph | Yes + PageRank | Yes | No | Yes |
| Adaptive Learning | Yes (EMA-based) | Yes | No | No |
| Embedding Models | Any (cloud/local) | Fixed | Fixed | Fixed |
| Built-in Benchmarks | Full suite | No | No | No |
When This Matters to You
You need claudemem if:
- grep isn't cutting it: You're searching for concepts, not exact strings
- You value privacy: Code never leaves your machine. Ever.
- You want to benchmark: Test embedding models on YOUR codebase
- You need code analysis: Dead code detection, test gaps, change impact
- You use Claude Code: MCP integration gives Claude semantic search superpowers
- You're cost-conscious: Free for local models, ~$0.01/1M tokens for cloud
Quick Start
# Install
npm install -g claude-codemem
# Setup
claudemem init
# Index your project
claudemem index
# Search
claudemem search "authentication flow"
claudemem search "where do we validate user input" That's it. Changed some files? Just search again - it auto-reindexes modified files before searching.
Talk to Us
We built claudemem because we needed it. We're using it daily. We're improving it constantly.
If you hit issues. If you have ideas. If you want to contribute. The door's open.
What code search problem are you trying to solve?
Product: claudemem (MadAppGang internal tool)
Duration: 8 months (ongoing development)
Stack: TypeScript, Bun, Tree-sitter, LanceDB, OpenRouter, PageRank
License: MIT (fully open source)
Outcome: 175% NDCG improvement, 100% local privacy, active community