claudemem: Because grep Wasn't Cutting It

Local Semantic Code Search for Claude Code and AI Agents

TypeScript
Bun
Tree-sitter
LanceDB
OpenRouter
PageRank
Hybrid Search
MCP

The Problem with grep

I've been writing code for over twenty years. Last month I spent forty-five minutes searching for a function I wrote myself three weeks earlier.

Forty-five minutes. For code I wrote.

The function was called silentTokenRefresh. Of course it was. I'd typed "token" into grep. Got 847 results. None of them were the actual token handler. I'd scrolled right past it twice before giving up and asking a colleague who remembered the name.

I started timing these searches. Three hours hunting for "the thing that validates webhook signatures." Two hours finding where we actually persist user preferences. An entire afternoon tracing why a settings change wasn't propagating—turned out there were four different settings services, and I was looking at the wrong three.

grep doesn't care about my memory. It doesn't know that when I search for "auth" I probably mean the token refresh flow, not the 200 files that happen to contain the word "authentication" in a comment.

I was tired of feeling stupid in codebases I wrote myself.

Get Started in 60 Seconds - Install, Index, Search

Three commands. That's all it takes to get semantic code search.

The Numbers

Metric	Result
Search accuracy (NDCG)	175% vs baseline (voyage-code-3)
Embedding models supported	15+ (cloud + local)
LLM summarizers benchmarked	18 models across 6 evaluation methods
Languages supported	8 (TypeScript, Python, Go, Rust, C/C++, Java)
Privacy	100% local - nothing leaves your machine
Index cost	~$0.01 per 1M tokens (cloud) / $0 (local)
Distribution	npm, Homebrew, shell installer
License	MIT (fully open source)

TL;DR

Semantic code search that finds what you mean, not what you type. 20-minute searches now take 3 seconds. 50+ repos. 100% local. My grep usage dropped to near zero.

The Architecture: Not Just Another Search Tool

claudemem builds a semantic graph of your codebase. Not a text index. Not a fuzzy matcher. An actual understanding of what calls what and why.

The indexer parses every file into an AST. Functions, classes, methods, calls—all of it becomes nodes in a graph. Then it runs PageRank. The same algorithm Google used to find important web pages, but pointed at your code.

This matters more than it sounds. When you search, results come back ranked by architectural importance. The core authentication handler ranks higher than the seventeen wrapper functions that call it. You find the heart of your system in seconds, not hours.

Semantic search sits on top. When I search "refresh tokens silently," it doesn't just pattern-match those words. It understands token operations, refresh patterns, silent execution. It finds silentTokenRefresh even though my query didn't match the exact name.

The caller/callee analysis closed the loop. Before touching any function, I see exactly what depends on it. Last month this caught a disaster—I was about to "clean up" a function that looked unused. claudemem showed nine callers. Nine were dead code we'd been maintaining for two years. The tenth was the payment processor.

We'd been maintaining code nobody called for two years. And we almost deleted code that would have broken payments.

This isn't a research project. It's the architecture I needed to stop wasting hours every week. Everything stays local in .claudemem/ in your project. Nothing goes to a server. Ever.

It changed how we work.

Embedding Model Benchmarks: We Tested Everything

Which embedding model is best for code search? We didn't guess. We measured.

Run claudemem benchmark to test models on your actual codebase. Here's what we found on real code search tasks:

Model	Speed	NDCG	Cost	Notes
voyage-code-3	4.5s	175%	$0.007	Best quality
gemini-embedding-001	2.9s	170%	$0.007	Great free option
voyage-3-large	1.8s	164%	$0.007	Fast & accurate
voyage-3.5-lite	1.2s	163%	$0.001	Best value (default)
voyage-3.5	1.2s	150%	$0.002	Fastest
mistral-embed	16.6s	150%	$0.006	Slow
text-embedding-3-small	3.0s	141%	$0.001	Decent
text-embedding-3-large	3.1s	141%	$0.005	Not worth it
all-minilm-l6-v2	2.7s	128%	$0.0001	Cheapest (local)

Summary

Best Quality: voyage-code-3 (175% NDCG)
Best Value: voyage-3.5-lite (163% NDCG, $0.001) - this is the default
Fastest: voyage-3.5 (1.2s)
Free/Local: all-minilm-l6-v2 via Ollama

claudemem LLM benchmark running with multiple generators and judges

Run `claudemem benchmark` to test models on your actual codebase. Multiple generators, multiple judges, real metrics.

LLM Summarizer Benchmarks: Which Model Describes Code Best?

claudemem generates natural language descriptions of code chunks. These descriptions power semantic search. Better descriptions = better search results.

We benchmarked 18 LLM models across 6 different evaluation methods:

Evaluation Methods

Judge (Pointwise): LLM rates each description's quality 0-10
Judge (Pairwise): LLM picks better description in head-to-head comparisons
Contrastive Discrimination: Can the description identify its source code among alternatives?
Retrieval: Does using the description improve search accuracy?
Downstream Tasks: Does the description help with actual coding tasks?
Self-Evaluation: Model rates its own outputs (calibration check)

Model	Retrieval	Contrastive	Judge	Overall
gpt-5.1-codex-max	23%	83%	78%	57%
nova-premier-v1	27%	79%	51%	56%
qwen3-235b-a22b-2507	13%	92%	79%	55%
opus	16%	80%	71%	54%
deepseek-v3.2	13%	82%	74%	52%
haiku	7%	82%	69%	49%

Operational Metrics

Fastest: haiku (3.7s avg latency)
Best Quality: gpt-5.1-codex-max (57% overall)
Best Value: deepseek-v3.2 (52% quality, low cost)

Premium LLM benchmark showing opus, haiku, gpt-5.2, gemini-3-pro, deepseek-v3.2 generating code descriptions

Benchmarking premium models: opus, haiku, gpt-5.2, gemini-3-pro, deepseek-v3.2, kimi-k2, and more running in parallel.

Symbol Graph: Beyond Search

Search is table stakes. The real power is the symbol graph.

claudemem tracks every reference between symbols. It computes PageRank scores based on how central each function/class is to your codebase. This enables:

Dead Code Detection

claudemem dead-code

Finds symbols with zero callers + low PageRank + not exported. Great for cleaning up unused code.

Test Coverage Gaps

claudemem test-gaps

Finds high-PageRank symbols not called by any test file. Prioritize what to test next.

Change Impact Analysis

claudemem impact FileTracker

Shows all transitive callers, grouped by file. Understand the blast radius before refactoring.

Symbol Navigation

claudemem symbol handleAuth     # find definition
claudemem callers handleAuth   # what calls this?
claudemem callees handleAuth   # what does this call?
claudemem context handleAuth   # all of the above

Self-Learning System (Experimental)

This is where it gets interesting.

Traditional ML validation assumes millions of samples and explicit labels. Our context is different: 50-500 sessions per project, no user ratings, data stays local.

claudemem's self-learning system uses implicit feedback signals:

Signal Type	How Detected	Weight
Lexical Correction	User says "no", "wrong", "actually"	0.30
Strategy Pivot	Sudden change in tool usage after failure	0.20
Overwrite	User edits same file region agent modified	0.35
Reask	User repeats similar prompt	0.15

The strongest signal: Code Survival Rate

code_survival_rate = lines_kept / lines_written_by_agent

If the user keeps the agent's code in their git commit, the agent did well. If they rewrite everything, it failed.

What Gets Generated

Skills: Automatable sequences become slash commands
Subagents: Error clusters become specialized agents
Prompt Optimizations: Correction patterns become prompt additions

Safety Validation

Changes are tested against a Red Team before deployment:

Red Team: Attacks with edge cases, malformed data, injections
Blue Team: Defends with validation, sanitization, limits
Safety Score: > 0.90 = Deploy, < 0.70 = Reject

Using with Claude Code

Run claudemem as an MCP server:

claudemem --mcp

Then Claude Code can use these tools:

search_code - semantic search (auto-indexes changes)
index_codebase - manual full reindex
get_status - check what's indexed
clear_index - start fresh

Detective Agents

Install the code-analysis plugin for pre-built agents that use claudemem:

developer-detective: Trace implementations, find usages
architect-detective: Analyze architecture, find patterns
tester-detective: Find test gaps, coverage analysis
debugger-detective: Trace errors, find bug sources

claudemem integrated with OpenCode IDE - showing available tools: map, search, symbol, callers, callees, context

claudemem running in OpenCode with GLM-4.7. All semantic search and code analysis tools available: map, search, symbol, callers, callees, context.

Documentation Indexing

claudemem can automatically fetch and index documentation for your dependencies. Search across both your code AND the frameworks you use.

Sources (in priority order):

Context7: 6000+ libraries, versioned API docs (free API key)
llms.txt: Official AI-optimized docs (Vue, Nuxt, Langchain, etc.)
DevDocs: Offline documentation for 100+ languages

claudemem docs fetch              # fetch docs for all detected dependencies
claudemem docs fetch react vue    # fetch specific libraries
claudemem docs status             # show indexed docs

How We Compare

Feature	claudemem	Context	Greptile	Amp
Cost	Free / MIT	Free (needs API)	$30/dev/mo	$1,000+ min
Privacy	100% Local	Cloud default	Cloud	Cloud only
CLI Tool	Yes	No	No	No
Symbol Graph	Yes + PageRank	Yes	No	Yes
Adaptive Learning	Yes (EMA-based)	Yes	No	No
Embedding Models	Any (cloud/local)	Fixed	Fixed	Fixed
Built-in Benchmarks	Full suite	No	No	No

When This Matters to You

You need claudemem if:

grep isn't cutting it: You're searching for concepts, not exact strings
You value privacy: Code never leaves your machine. Ever.
You want to benchmark: Test embedding models on YOUR codebase
You need code analysis: Dead code detection, test gaps, change impact
You use Claude Code: MCP integration gives Claude semantic search superpowers
You're cost-conscious: Free for local models, ~$0.01/1M tokens for cloud

Quick Start

# Install
npm install -g claude-codemem

# Setup
claudemem init

# Index your project
claudemem index

# Search
claudemem search "authentication flow"
claudemem search "where do we validate user input"

That's it. Changed some files? Just search again - it auto-reindexes modified files before searching.

claudemem in action: semantic code search that actually understands what you're looking for.

Mass model benchmark running 25+ models in parallel - qwen, codestral, ministral, olmo, kat-coder, and more

Benchmark everything. 25+ models running in parallel. Find the best embedding model for your codebase.

Talk to Us

We built claudemem because we needed it. We're using it daily. We're improving it constantly.

If you hit issues. If you have ideas. If you want to contribute. The door's open.

What code search problem are you trying to solve?

Product: claudemem (MadAppGang internal tool)
Duration: 8 months (ongoing development)
Stack: TypeScript, Bun, Tree-sitter, LanceDB, OpenRouter, PageRank
License: MIT (fully open source)
Outcome: 175% NDCG improvement, 100% local privacy, active community