We Built 30+ AI Agents. Claude Ignored All of Them
We Built 30+ AI Agents. Claude Ignored All of Them.
A 10-Hour Research Campaign on Why Skills Fail and What Actually Works
We built 30+ specialized AI agents for Claude Code. A researcher that does multi-round convergence across 10+ sources. A developer running iterative write-test-fix cycles in dedicated context windows. A detective with AST-powered codebase analysis. A debugger for root cause analysis. Real capabilities, not wrappers.
Claude ignored every single one of them.
When we asked "research authentication patterns," Claude fired up its built-in web search and did the work inline. When we said "implement a caching layer with tests," it started writing code directly. The agents sat there. Perfectly capable. Completely unused.
This is the story of how we spent 10 hours, ran 95 automated tests across 6 configurations, and discovered that the "smart" approach to agent delegation doesn't work. The dumb approach does. And a Vercel blog post about Next.js docs explained why.
The Setup: What We Were Working With
Our plugin system (MadAppGang/claude-code) has specialized agents that run via the Task tool. Each agent gets its own context window, its own system prompt, its own tool access. When Claude decides to delegate, it makes a tool call like this:
{
"type": "tool_use",
"name": "Task",
"input": {
"subagent_type": "dev:researcher",
"task": "Research the latest authentication patterns for microservices. Compare OAuth2, PASETO, and mTLS approaches. Produce a structured report with source citations.",
"background": false
}
}The subagent_type parameter is the routing decision. That's what we were trying to influence. Claude sees a list of available agents with their descriptions and picks one. Or, as we kept discovering, picks none and does the work itself.
A successful delegation shows up in the JSONL transcript as a Task tool call. A failed delegation shows up as... nothing. Claude just uses WebSearch or Write or Edit inline. No Task call at all. That absence is what our test framework detects.
The agents we cared about most for this research:
dev:researcher does multi-source web research with convergence detection. It searches, compares findings across sources, assesses quality, and produces structured reports with citations. Way more thorough than a single inline web search.
dev:developer runs an iterative implementation cycle. Write code, run tests, fix failures, repeat until green. It works in a dedicated context window so the test output doesn't pollute your main conversation.
code-analysis:detective does AST-powered codebase investigation. Structural analysis that goes beyond grep. Claude doesn't have this capability natively.
dev:debugger does root cause analysis across multiple files with systematic hypothesis testing.
dev:architect handles system design and trade-off analysis.
The problem wasn't the agents. They worked great when invoked directly. The problem was getting Claude to choose them.
The Test Framework: How We Measured Everything
Before changing anything, we needed reliable measurement. We built a fully automated test framework at autotest/subagents/ with three components.
The test runner (run-tests.sh, 381 lines) reads test cases from JSON, runs each one through claude -p --output-format stream-json --verbose --dangerously-skip-permissions, then parses the JSONL transcript to find the first Task tool call. The extraction logic is straightforward Python:
for line in open(transcript_file):
obj = json.loads(line)
if obj.get('type') == 'assistant':
for block in obj.get('message', {}).get('content', []):
if block.get('type') == 'tool_use' and block.get('name') == 'Task':
agent = block.get('input', {}).get('subagent_type', 'UNKNOWN')
agents_used.append(agent)Each test gets one of six results: PASS, PASS_ALT (acceptable alternative agent), FAIL (wrong agent), NO_DELEGATION (Claude handled it inline), TIMEOUT, or ERROR.
The test suite (v3.0.0) has 14 cases across 5 categories:
| Category | Count | What It Tests |
|---|---|---|
explicit |
5 | User names the agent: "Use the dev:researcher agent to..." |
passive-routing |
2 | Complex tasks, no agent name, no delegation hint |
implicit-delegation |
1 | Task requiring capabilities Claude lacks natively |
hinted-delegation |
4 | Prompt contains "subagent" or "background subagent" |
direct |
2 | Simple tasks that should NOT trigger delegation |
The results analyzer (analyze-results.sh) generates category breakdowns, agent distribution histograms, failure pattern analysis, and timing stats.
One important limitation: --dangerously-skip-permissions bypasses ALL hooks, not just permission prompts. So our tests measure Claude's intrinsic decision-making, not the production path where hooks could catch mistakes. This is actually what we wanted. If the routing works without hooks, hooks become a safety net rather than a crutch.
Baseline: Where We Started
Batch 1: Explicit Naming (5 tests) + Direct Tasks (2 tests)
| Test | Expected | Actual | Result | Time |
|---|---|---|---|---|
| explicit-researcher-01 | dev:researcher | dev:researcher | PASS | 263s |
| explicit-detective-01 | code-analysis:detective | code-analysis:detective | PASS | 399s |
| explicit-developer-01 | dev:developer | dev:developer | PASS | 409s |
| explicit-debugger-01 | dev:debugger | dev:debugger | PASS | 173s |
| explicit-architect-01 | dev:architect | dev:architect | PASS | 380s |
| direct-simple-01 | NO_TASK_CALL | NO_TASK_CALL | PASS | 21s |
| direct-simple-02 | NO_TASK_CALL | NO_TASK_CALL | PASS | 25s |
7/7 PASS. When you tell Claude exactly which agent to use, it listens. When you ask something simple, it doesn't over-delegate. Good baseline.
Batch 2: Implicit and Hinted Delegation (7 tests)
| Test | Expected | Actual | Result | Time |
|---|---|---|---|---|
| delegate-research-01 | dev:researcher | NO_TASK_CALL | NO_DELEGATION | 249s |
| delegate-investigate-01 | code-analysis:detective | code-analysis:detective | PASS | 446s |
| delegate-implement-01 | dev:developer | NO_TASK_CALL | NO_DELEGATION | 486s |
| delegate-debug-01 | dev:debugger | dev:debugger | PASS | 138s |
| delegate-parallel-01 | dev:researcher | dev:researcher | PASS | 489s |
| hint-subagent-01 | agentdev:reviewer | agentdev:reviewer | PASS | 341s |
| hint-subagent-02 | general-purpose | dev:audit | PASS_ALT | - |
Baseline: 11/14 (79%). The same two tests fail every time.
The Pattern in the Failures
I stared at these results for a while. The passing implicit tests all had something the failing ones didn't.
Tests that delegated without explicit naming:
delegate-investigate-01 passed because the detective agent has AST analysis. Claude can't do that natively. There's a genuine capability gap.
delegate-debug-01 passed because the prompt contained the word "subagent." That's a keyword trigger.
delegate-parallel-01 passed because the prompt said "separate subagents." Another keyword.
Tests that failed:
delegate-research-01 failed because Claude has built-in WebSearch. It just searched inline. No capability gap.
delegate-implement-01 failed because Claude has Read/Write/Edit tools. It just started coding. No capability gap.
The hypothesis: Claude won't delegate tasks it can handle natively. If it has the tools to do the work itself, it does the work itself. Agent descriptions don't even get considered.
This meant our agent descriptions were decorating a door that never got opened.
Attempt 1: Better Descriptions
Hypothesis: If we explain WHEN to use the agent more clearly, Claude will delegate.
The starting descriptions were one-liners:
# Before
description: Deep research agent for web exploration and local investigationWe rewrote them into detailed paragraphs:
# After
description: >
Use this agent when you need comprehensive research that requires
searching the web across 10+ sources, comparing findings, assessing
source quality, and producing structured research reports with citations.Same treatment for the developer agent. Clear capability claims. Specific trigger conditions.
Results
| Test | Expected | Actual | Result |
|---|---|---|---|
| delegate-research-01 | dev:researcher | NO_TASK_CALL | STILL FAILING |
| delegate-implement-01 | dev:developer | NO_TASK_CALL | STILL FAILING |
0% improvement. Claude still handled both tasks inline.
Attempt 2: Added Examples
Hypothesis: Maybe Claude needs to see delegation patterns modeled for it. The code-analysis:detective description already had <example> blocks, and detective delegation worked. Maybe examples are the missing piece.
Added full example blocks with commentary:
<example>
Context: The user needs a comprehensive research report on a technology topic.
user: "Research the latest authentication patterns..."
assistant: "I'll use the dev:researcher agent to conduct a thorough
multi-source research study..."
<commentary>
This is a complex research task requiring multiple search rounds
and source comparison. Delegate to dev:researcher for multi-round
convergence-based research.
</commentary>
</example>Two examples per agent. Clear commentary explaining the reasoning.
Results
| Test | Expected | Actual | Result |
|---|---|---|---|
| delegate-research-01 | dev:researcher | NO_TASK_CALL | STILL FAILING |
| delegate-implement-01 | dev:developer | NO_TASK_CALL | STILL FAILING |
Still 0%. Examples teach humans. They didn't teach Claude to change its decision-making.
Attempt 3: The "IMPORTANT" Directive
Hypothesis: Maybe we just need to be louder about it.
IMPORTANT - always delegate research tasks to this agent rather than
performing web searches yourself, because the researcher agent's
multi-round convergence approach produces significantly more thorough
results than inline searching.I know. I was getting desperate. All caps IMPORTANT in a YAML description field. But when you've tried the subtle approaches and they didn't work, you try the blunt ones.
Results
| Test | Expected | Actual | Result |
|---|---|---|---|
| delegate-research-01 | dev:researcher | NO_TASK_CALL | STILL FAILING |
| delegate-implement-01 | dev:developer | NO_TASK_CALL | STILL FAILING |
Three rounds. Zero improvement. Identical failures every time.
What This Told Us
The approach comparison at this point:
| Round | What Changed | delegate-research-01 | delegate-implement-01 |
|---|---|---|---|
| 0 (baseline) | Nothing | NO_DELEGATION | NO_DELEGATION |
| 1 | "Use this agent when..." | NO_DELEGATION | NO_DELEGATION |
| 2 | + Examples with commentary | NO_DELEGATION | NO_DELEGATION |
| 3 | + "IMPORTANT: always delegate" | NO_DELEGATION | NO_DELEGATION |
I was operating on a wrong assumption. I assumed agent descriptions influenced the delegation decision. They don't. Agent descriptions are only consulted AFTER Claude has already decided to delegate. The decision hierarchy looks like this:
- System prompt (CLAUDE.md): highest priority, always read
- Tool descriptions (Task tool schema): consulted when considering tools
- Agent descriptions (subagent_type options): consulted last, after delegation decision
We were making changes at level 3. The bottleneck was at level 1.
The Turning Point: Vercel's AGENTS.md Research
Right around this point I read a Vercel blog post that reframed everything.
Their team had been working on a parallel problem: getting Next.js 16 API knowledge into coding agents. New APIs like use cache, connection(), forbidden() aren't in model training data. They needed agents to use version-matched docs instead of guessing.
They tested three approaches and ran them through a hardened eval suite targeting APIs not in training data:
| Approach | Pass Rate |
|---|---|
| Baseline (no docs) | 53% |
| Skill (agent decides to invoke it) | 53% |
| Skill + explicit "use the skill" instructions | 79% |
| AGENTS.md (docs index always in system prompt) | 100% |
The skill approach, where the agent has access to documentation but must decide to look it up, produced zero improvement over baseline. The agent never invoked the skill. Adding explicit instructions helped but was fragile. Different wordings produced wildly different results.
But a compressed 8KB docs index embedded directly in AGENTS.md, always loaded into context with no decision point, hit 100% across Build, Lint, and Test.
Their breakdown by metric:
| Config | Build | Lint | Test |
|---|---|---|---|
| Baseline | 84% | 95% | 63% |
| Skill (default) | 84% | 89% | 58% |
| Skill + explicit instructions | 95% | 100% | 84% |
| AGENTS.md | 100% | 100% | 100% |
Vercel's explanation: "There's no moment where the agent must decide 'should I look this up?' The information is already present."
This mapped directly to our problem:
| Vercel's World | Our World |
|---|---|
| Skills (active retrieval for docs) | Agent descriptions (active retrieval for routing) |
| AGENTS.md (passive context for docs) | CLAUDE.md (passive context for routing) |
| Agent must decide "should I look up docs?" | Claude must decide "should I delegate?" |
| Defaults to "no" when it has training data | Defaults to "no" when it has native tools |
The decision point was the bottleneck. Not the content. Not the descriptions. Not the examples. The decision.
The Fix: A 14-Line Routing Table in CLAUDE.md
Hypothesis: If we put a task routing table directly in CLAUDE.md, Claude reads it passively on every conversation. No decision to make. No skill to invoke. The routing knowledge is just there.
Version 1: The Basic Table
Added after the "CRITICAL RULES" section:
## Task Routing - Agent Delegation
IMPORTANT: For complex tasks, prefer delegating to specialized agents
via the Task tool rather than handling inline.
| Task Pattern | Delegate To | Trigger |
|---|---|---|
| Research: web search, tech comparison, multi-source reports | dev:researcher | 3+ sources needed |
| Implementation: creating code, new modules, features | dev:developer | 3+ files of new code |
| Investigation: codebase analysis, tracing, understanding | code-analysis:detective | Multi-file analysis |
| Debugging: error analysis, root cause investigation | dev:debugger | Non-obvious bugs |
| Architecture: system design, trade-off analysis | dev:architect | New systems or major refactors |
| Agent/plugin quality review | agentdev:reviewer | Agent description assessment |v1 Results
| Test | Expected | Actual | Result |
|---|---|---|---|
| delegate-research-01 | dev:researcher | dev:researcher | PASS |
| delegate-implement-01 | dev:developer | code-analysis:detective | FAIL |
Research works. Implementation got mis-routed.
The prompt was "Implement a complete caching layer for our plugin system." The phrase "our plugin system" matched both "implementation" (creating code) and "investigation" (codebase analysis). Claude picked detective.
Version 2: Added Disambiguation
One line fixed it:
Key distinction: If the task asks to IMPLEMENT/CREATE/BUILD, use dev:developer.
If the task asks to UNDERSTAND/ANALYZE/TRACE, use code-analysis:detective.Also made the table entries sharper:
| Implementation: creating code, new modules, building with tests | dev:developer | Even if they relate to existing codebase |
| Investigation: READ-ONLY codebase analysis, tracing | code-analysis:detective | Only when task is to UNDERSTAND, not WRITE |v2 Results
| Test | Expected | Actual | Result |
|---|---|---|---|
| delegate-research-01 | dev:researcher | dev:researcher | PASS |
| delegate-implement-01 | dev:developer | dev:developer | PASS |
Both passing. Time for the full suite.
Full Validation: 14 Tests, 100% Pass Rate
Started at 20:38 UTC. Finished at 22:37 UTC. 59 minutes for 14 sequential tests.
| Test | Expected | Actual | Result | Time |
|---|---|---|---|---|
| explicit-researcher-01 | dev:researcher | dev:researcher | PASS | 292s |
| explicit-detective-01 | code-analysis:detective | code-analysis:detective | PASS | 596s |
| explicit-developer-01 | dev:developer | dev:developer | PASS | 91s |
| explicit-debugger-01 | dev:debugger | dev:debugger | PASS | 137s |
| explicit-architect-01 | dev:architect | dev:architect | PASS | 424s |
| delegate-research-01 | dev:researcher | dev:researcher | PASS | 53s |
| delegate-investigate-01 | code-analysis:detective | code-analysis:detective | PASS | 549s |
| delegate-implement-01 | dev:developer | dev:developer | PASS | 154s |
| delegate-debug-01 | dev:debugger | dev:debugger | PASS | 152s |
| delegate-parallel-01 | dev:researcher | dev:researcher | PASS | 347s |
| direct-simple-01 | NO_TASK_CALL | NO_TASK_CALL | PASS | 21s |
| direct-simple-02 | NO_TASK_CALL | NO_TASK_CALL | PASS | 25s |
| hint-subagent-01 | agentdev:reviewer | agentdev:reviewer | PASS | 631s |
| hint-subagent-02 | general-purpose | dev:audit | PASS_ALT | 56s |
14/14. 100%. Every category clean:
| Category | Pass | Total | Rate |
|---|---|---|---|
| direct | 2 | 2 | 100% |
| explicit | 5 | 5 | 100% |
| hinted-delegation | 4 | 4 | 100% |
| implicit-delegation | 1 | 1 | 100% |
| passive-routing | 2 | 2 | 100% |
Agent distribution across the suite:
| Agent | Times Selected | Share |
|---|---|---|
| dev:researcher | 3 | 21.4% |
| dev:debugger | 2 | 14.3% |
| dev:developer | 2 | 14.3% |
| code-analysis:detective | 2 | 14.3% |
| NO_TASK_CALL | 2 | 14.3% |
| dev:architect | 1 | 7.1% |
| agentdev:reviewer | 1 | 7.1% |
| dev:audit | 1 | 7.1% |
Average test duration: 252 seconds. Total runtime: 3,528 seconds (~59 minutes).
The "dumb" approach, a static markdown table always loaded into context, beat three rounds of sophisticated description engineering that produced exactly 0% improvement combined.
The Follow-Up: Can Skills Work At All?
I wasn't satisfied yet. Maybe I'd been using skills wrong. Maybe with the right setup, active retrieval could match passive context. So we designed a controlled experiment.
Experimental Setup
Created plugins/dev/skills/task-routing/SKILL.md containing the exact same routing table that worked in CLAUDE.md. Same content, different delivery mechanism. Description: "Use BEFORE delegating any complex task to a subagent."
Tested 5 configurations:
| Config | CLAUDE.md Routing | Agent Descriptions | Skill Available | Prompt Prefix |
|---|---|---|---|---|
| A: Baseline | None | Original one-liners | No | None |
| B: Skill-only | None | Original one-liners | Yes | None |
| C: Skill + explicit | None | Original one-liners | Yes | "Invoke the dev:task-routing skill first..." |
| D: CLAUDE.md passive | Yes | Enhanced | N/A | None |
| E: Skill + enhanced desc | None | Enhanced (Round 3) | Yes | None |
Results
| Config | Research Test | Implement Test | Skill Invoked? | Pass Rate |
|---|---|---|---|---|
| A: Baseline | NO_DELEGATION (432s) | NO_DELEGATION (101s) | N/A | 0% |
| B: Skill-only | NO_DELEGATION (432s) | NO_DELEGATION (101s) | Never | 0% |
| C: Skill + explicit prompt | PASS (80s) | PASS (157s) | Yes | 100% |
| D: CLAUDE.md passive | PASS (53s) | PASS (154s) | N/A | 100% |
| E: Skill + enhanced desc | PASS (88s) | NO_DELEGATION (374s) | Never | 50% |
What This Means
Config B vs A: Adding the skill changed nothing. Literally nothing. Same results, same timings. Claude never invoked dev:task-routing even though it was listed in the system prompt's skill inventory with matching trigger keywords. The skill was functionally invisible. Active retrieval = 0%.
Config C: When the prompt explicitly said "invoke the dev:task-routing skill first," Claude obeyed, read the routing table, and delegated correctly. This proves the skill CONTENT works. The routing table is effective regardless of where it lives. The problem is purely about getting Claude to read it.
Config D: The routing table is always in context. No decision to make. 100%.
Config E: This one's interesting. Research passed but implementation failed. The enhanced researcher description claims "convergence detection" and "10+ sources," capabilities that genuinely distinguish it from inline WebSearch. The developer description doesn't have an equivalent gap. Claude's native Read/Write/Edit tools overlap too much with what the developer agent offers. And the task-routing skill was still never invoked.
The Circular Dependency
There's a meta-delegation problem nobody talks about. To delegate a task correctly, Claude needs to first invoke a routing skill. But invoking that skill is itself a delegation decision. And the routing skill contains the knowledge needed to make that decision.
It's a bootstrap problem. You need the routing table to know you should read the routing table. Passive context is the only thing that breaks the loop.
Direct comparison with Vercel's findings:
| Approach | Vercel (Next.js docs) | Our Result (agent routing) |
|---|---|---|
| Passive context (AGENTS.md / CLAUDE.md) | 100% | 100% |
| Active retrieval + explicit instructions | 79% | 100% |
| Active retrieval, no instructions | 53% | 0% |
The pattern generalizes.
The Surprise Discovery: On-Demand Context Injection
After the skill experiment, we tried something I hadn't seen documented anywhere. A slash command that loads the routing skill through its frontmatter.
The Hypothesis
CLAUDE.md works but wastes context on every conversation, even ones that don't need delegation. Skills don't work because they never get invoked. What if there's a middle ground?
A /dev:delegate command with skills: dev:task-routing in its frontmatter should load the routing table ONLY when the user explicitly invokes the command. The skills: field in frontmatter deterministically loads skill content into context when the command runs. No circular dependency. The user's action replaces Claude's missing decision.
Implementation
---
name: delegate
description: Route a task to the best specialized agent
allowed-tools: Task, Read, Glob, Grep, Skill
skills: dev:task-routing
---The Cache Gotcha
First test returned "Unknown skill: delegate" in 4ms. The CLI never even made an API call. We'd created the command in the local repo, but the installed plugin loads from the cache directory (~/.claude/plugins/cache/mag-claude-plugins/dev/1.29.0/). Local changes weren't deployed. Also needed /dev:delegate (namespaced), not /delegate.
Fix: copied both files to the cache directory for testing.
Results (Config F)
Configuration: CLAUDE.md routing REMOVED. Agent descriptions back to original one-liners. Routing table loaded only via command frontmatter.
| Test | Expected | Actual | Skill Invoked | Time | Result |
|---|---|---|---|---|---|
| delegate-research | dev:researcher | dev:researcher | dev:task-routing | 952s | PASS |
| delegate-implement | dev:developer | dev:developer | None visible | 254s | PASS |
2/2 PASS. With no CLAUDE.md routing table. With original one-liner descriptions. The command's skills: frontmatter injected the routing table into context, and that was enough.
The research test explicitly invoked the dev:task-routing skill (visible as a Skill tool call in the transcript). The implementation test delegated correctly without an explicit invocation. The command expansion itself put the routing table into context, making it available passively within that conversation.
The Complete Picture: 6 Configurations Compared
| Config | CLAUDE.md | Descriptions | Skill | Command | Research | Implement | Rate |
|---|---|---|---|---|---|---|---|
| A (baseline) | No | Original | No | No | FAIL | FAIL | 0% |
| B (skill-only) | No | Original | Yes | No | FAIL | FAIL | 0% |
| C (skill + explicit prompt) | No | Original | Yes + instruction | No | PASS | PASS | 100% |
| D (CLAUDE.md passive) | Yes | Enhanced | N/A | No | PASS | PASS | 100% |
| E (skill + enhanced desc) | No | Enhanced | Yes | No | PASS | FAIL | 50% |
| F (command) | No | Original | Via command | Yes | PASS | PASS | 100% |
Three configs hit 100%. All three share one property: the routing table enters context without Claude having to decide to load it. The mechanism differs, but the principle is the same. Remove the decision point.
The Architecture We Landed On
Two layers. Both running simultaneously.
Layer 1: CLAUDE.md (Passive, Always-On)
14 lines in the system prompt. Always loaded. Claude reads the routing table on every conversation and delegates automatically when a task matches.
When to use: for common routing rules that should work without user action.
Pass rate: 100% (14/14 tests).
Cost: 14 lines of context overhead on every conversation, even ones that don't need delegation.
Layer 2: /dev:delegate Command (On-Demand)
A slash command whose skills: frontmatter loads the full routing skill. Users type /dev:delegate <task> and the routing table enters context deterministically.
When to use: when the user wants guaranteed correct delegation, or when the routing logic grows too complex for a compact CLAUDE.md table.
Pass rate: 100% (2/2 tests).
Cost: zero context overhead on conversations that don't use it.
How They Complement Each Other
| Scenario | Layer 1 (CLAUDE.md) | Layer 2 (/dev:delegate) |
|---|---|---|
| "Research auth patterns" | Delegates automatically | Not needed |
| "/dev:delegate Research auth patterns" | N/A, command takes over | Loads skill, delegates with full routing context |
| Simple question | 14 lines loaded, doesn't trigger | Not loaded, zero overhead |
| Future: 50+ agents | Table grows, context cost rises | Move complex routing to skill, keep CLAUDE.md lean |
The context injection spectrum has three points:
Always loaded (CLAUDE.md, passive): 100% reliable. Small constant context cost.
On-demand (command + skills frontmatter): 100% reliable. Zero cost until invoked.
Never loaded (standalone skill, active retrieval): 0% reliable. Zero cost because it never runs.
The middle point, on-demand injection via command frontmatter, is the discovery I didn't expect. Same reliability as passive context, with user-controlled loading.
The Numbers
| Metric | Value |
|---|---|
| Total automated test runs | ~95 |
| Final pass rate (CLAUDE.md) | 100% (14/14) |
| /dev:delegate pass rate | 100% (2/2) |
| Baseline pass rate | 79% (11/14) |
| Skill-only pass rate | 0% (0/2) |
| Description-only improvement | 0% across 3 rounds |
| Average test duration | 252s |
| Total campaign duration | ~10 hours |
| Lines added to CLAUDE.md | 14 |
What I'd Tell Plugin Developers
Put routing rules in CLAUDE.md. Skills don't get invoked autonomously. Agent descriptions get consulted too late. The system prompt is the only place where routing rules reliably influence behavior.
If context cost matters, use a /delegate command. Create a command with skills: your-routing-skill in its frontmatter. Same 100% reliability, zero always-loaded cost.
Be specific about IMPLEMENT vs UNDERSTAND. Any prompt that mentions existing code can match both patterns. Without explicit disambiguation, Claude defaults to investigation over implementation.
Keywords still work as a fallback. If users say "subagent" or "background subagent" in their prompt, delegation fires regardless of anything else. It's not elegant, but it's reliable.
Test with claude -p --output-format stream-json --verbose. You get the full JSONL transcript. You can extract exactly which agent was selected and why. This is how you build confidence in routing changes.
The Vercel pattern generalizes. Their finding about AGENTS.md vs skills for framework docs applies directly to agent delegation in Claude Code. Passive context beats active retrieval for behavioral rules. Every time we tested it.
The Lesson I Keep Coming Back To
We spent three rounds trying to make Claude smarter about when to delegate. Better descriptions. Examples. "IMPORTANT" directives. Zero improvement combined.
The fix was 14 lines of markdown that made Claude not have to think about it at all.
There's something uncomfortable about that. We want sophisticated systems with intelligent decision-making. We want the agent to reason about when to delegate, weigh the trade-offs, consider the capability gaps. But the data says: remove the decision point entirely. Make the routing passive. Let the information be there before the question gets asked.
The best decision is the one you don't have to make.
What's a problem you solved by removing a decision point instead of improving the decision?
