We Built 30+ AI Agents. Claude Ignored All of Them

Jack Rudenko
CTO of MadAppGang

A 10-hour research campaign on why skills fail and what actually works

We have built 30+ specialised AI agents for the Claude code. A researcher that does multi-round convergence across 10+ sources. A developer running iterative write-test-fix cycles in dedicated context windows. A detective with AST-powered codebase analysis. A debugger for root cause analysis. Real capabilities, not wrappers.

Claude ignored every single one of them.

When we asked "research authentication patterns," Claude fired up its built-in web search and did the work inline. When we said, "Implement a caching layer with tests," it started writing code directly. The agents sat there. Perfectly capable. Completely unused.

This is the story of how we spent 10 hours, ran 95 automated tests across 6 configurations, and discovered that the "smart" approach to agent delegation doesn't work. The dumb approach does. And a Vercel blog post about Next.js docs explained why.

The setup: What we were working with

Our plugin system (MadAppGang/claude-code) has specialized agents that run via the 'Task' tool. Each

agent gets its own context window, its own system prompt, its own tool access. When Claude decides to

delegate, it makes a tool call like this:

json
{
  "type": "tool_use",
 
  "name": "Task",
 
  "input": {
 
    "subagent_type": "dev:researcher",
 
    "task": "Research the latest authentication patterns for microservices. Compare OAuth2, PASETO, and mTLS approaches. Produce a structured report with source citations.",
 
    "background": false
 
  }
}

The `subagent_type` parameter is the routing decision. That's what we were trying to influence. Claude sees a list of available agents with their descriptions and picks one. Or, as we kept discovering, picks none and does the work itself.

A successful delegation shows up in the JSONL transcript as a `Task` tool call. A failed delegation shows up as... nothing. Claude just uses `WebSearch`, or `Write`, or `Edit` inline. No `Task` call at all. That absence is what our test framework detects.

The agents we cared about most for this research:

`dev:researcher` does multi-source web research with convergence detection. It searches, compares findings across sources, assesses quality, and produces structured reports with citations. Way more thorough than a single inline web search.

`dev:developer` runs an iterative implementation cycle. Write code, run tests, fix failures, repeat until green. It works in a dedicated context window, so the test output doesn't pollute your main conversation.

`code-analysis:detective` does AST-powered codebase investigation. Structural analysis that goes beyond grep. Claude doesn't have this capability natively.

`dev:debugger` does root cause analysis across multiple files with systematic hypothesis testing.

`dev:architect` handles system design and trade-off analysis.

The problem wasn't the agents. They worked great when invoked directly. The problem was getting Claude to choose them.

The test framework: How we measured everything

Before changing anything, we needed a reliable measurement. We built a fully automated test framework at `autotest/subagents/` with three components.

The test runner (`run-tests.sh`, 381 lines) reads test cases from JSON, runs each one through `claude -p --output-format stream-json --verbose --dangerously-skip-permissions`, then parses the JSONL transcript to find the first `Task` tool call. The extraction logic is straightforward in Python:

python
 
for line in open(transcript_file):
 
    obj = json.loads(line)
 
    if obj.get('type') == 'assistant':
 
        for block in obj.get('message', {}).get('content', []):
 
            if block.get('type') == 'tool_use' and block.get('name') == 'Task':
 
                agent = block.get('input', {}).get('subagent_type', 'UNKNOWN')
 
                agents_used.append(agent)

Each test gets one of six results: `PASS`, `PASS_ALT` (acceptable alternative agent), `FAIL` (wrong agent), `NO_DELEGATION` (Claude handled it inline), `TIMEOUT`, or `ERROR`.

The test suite (v3.0.0) has 14 cases across 5 categories:

The results analyzer (`analyze-results.sh`) generates category breakdowns, agent distribution histograms, failure pattern analysis, and timing stats.

One important limitation: `--dangerously-skip-permissions` bypasses ALL hooks, not just permission prompts. So our tests measure Claude's intrinsic decision-making, not the production path where hooks could catch mistakes. This is actually what we wanted. If the routing works without hooks, hooks become a safety net rather than a crutch.

Baseline: Where we started

Batch 1: Explicit naming (5 tests) + Direct tasks (2 tests)

Table with the results of testing Claude AI agents

7/7 PASS. When you tell Claude exactly which agent to use, it listens. When you ask something simple, it doesn't over-delegate. Good baseline.

Batch 2: Implicit and hinted delegation (7 tests)

Table - Test Results Implicit and Hinted Delegation

Baseline: 11/14 (79%). The same two tests fail every time.

The pattern in the failures

We stared at these results for a while. The passing implicit tests all had something the failing ones didn't.

Tests that delegated without explicit naming:

`delegate-investigate-01` passed because the detective agent has AST analysis. Claude can't do that natively. There's a genuine capability gap.

`delegate-debug-01` passed because the prompt contained the word "subagent." That's a keyword trigger.

`delegate-parallel-01` passed because the prompt said "separate subagents." Another keyword.

Tests that failed:

`delegate-research-01` failed because Claude has built-in `WebSearch`. It just searched inline. No capability gap.

`delegate-implement-01` failed because Claude has `Read`/`Write`/`Edit` tools. It just started coding. No capability gap.

The hypothesis: Claude won't delegate tasks that it can handle natively. If it has the tools to do the work itself, it does the work itself. Agent descriptions don't even get considered.

This meant our agent descriptions were decorating a door that never got opened.

Attempt 1: Better descriptions

Hypothesis: If we explain WHEN to use the agent more clearly, Claude will delegate.

The starting descriptions were one-liners:

yaml

# Before
description: Deep research agent for web exploration and local investigation

We rewrote them into detailed paragraphs:

yaml

# After
description: >
  Use this agent when you need comprehensive research that requires searching the web across 10+ sources, comparing findings, assessing source quality, and producing structured research reports with citations.

Same treatment for the developer agent. Clear capability claims. Specific trigger conditions.

Results

0% improvement. Claude still handled both tasks inline.

Attempt 2: Added examples

Robot-agent is carrying a magnifying glass to a confusing AI engineer

Hypothesis: Maybe Claude needs to see delegation patterns modeled for it. The `code-analysis:detective` description already had `<example>` blocks, and detective delegation worked. Maybe examples are the missing piece.

Added full example blocks with commentary:

xml
 
<example>
  Context: The user needs a comprehensive research report on a technology topic.
  user: "Research the latest authentication patterns..."
  assistant: "I'll use the dev:researcher agent to conduct a thorough multi-source research study..."
  <commentary>
  This is a complex research task requiring multiple search rounds and source comparison. Delegate to dev:researcher for multi-round convergence-based research.
  </commentary>
</example>

Two examples per agent. Clear commentary explaining the reasoning.

Results

Still 0%. Examples teach humans. They didn't teach Claude to change its decision-making.

Attempt 3: The "IMPORTANT" directive

AI engineer is writing instructions to AI agents

Hypothesis: Maybe we just need to be louder about it.

IMPORTANT - always delegate research tasks to this agent rather than performing web searches yourself, because the researcher agent's multi-round convergence approach produces significantly more thorough results than inline searching.

We know. We were getting desperate. All caps IMPORTANT in a YAML description field. But when you've tried the subtle approaches, and they didn't work, you try the blunt ones.

Results

Three rounds. Zero improvement. Identical failures every time.

What this told us

The approach comparison at this point:

We were operating on a wrong assumption. We assumed agent descriptions influenced the delegation decision. They don't. Agent descriptions are only consulted AFTER Claude has already decided to delegate. The decision hierarchy looks like this:

System prompt (CLAUDE.md): highest priority, always read.
Tool descriptions (Task tool schema): consulted when considering tools.
Agent descriptions (subagent_type options): consulted last, after delegation decision.

We were making changes at level 3. The bottleneck was at level 1.

Robots-agents in laboratory claiming the pyramid

The turning point: Vercel's AGENTS.md research

Right around this point, I read a Vercel blog post that reframed everything.

Their team had been working on a parallel problem: getting Next.js 16 API knowledge into coding agents. New APIs like `use cache`, `connection()`, `forbidden()` aren't in model training data. They need agents to use version-matched docs instead of guessing.

They tested three approaches and ran them through a hardened eval suite targeting APIs not in training

data:

The skill approach, where the agent has access to documentation but must decide to look it up,

produced zero improvement over baseline. The agent never invoked the skill. Adding explicit instructions

helped, but it was fragile. Different wordings produced wildly different results.

But a compressed 8KB docs index embedded directly in AGENTS.md, always loaded into context with no

decision point, hit 100% across Build, Lint, and Test.

Their breakdown by metric:

Vercel's explanation: "There's no moment where the agent must decide 'should I look this up?' The

information is already present."

This mapped directly to our problem:

The decision point was the bottleneck. Not the content. Not the descriptions. Not the examples. The

decision.

The Fix: A 14-line routing table in CLAUDE.md

Hypothesis: If we put a task routing table directly in CLAUDE.md, Claude reads it passively on every

conversation. No decision to make. No skill to invoke. The routing knowledge is just there.

Version 1: The Basic Table

Added after the "CRITICAL RULES" section:

## Task Routing - Agent Delegation
IMPORTANT: For complex tasks, prefer delegating to specialized agents
via the Task tool rather than handling inline.
| Task Pattern | Delegate To | Trigger |
|---|---|---|
| Research: web search, tech comparison, multi-source reports | dev:researcher | 3+
sources needed |
| Implementation: creating code, new modules, features | dev:developer | 3+ files of
new code |
| Investigation: codebase analysis, tracing, understanding | code-analysis:detective
| Multi-file analysis |
| Debugging: error analysis, root cause investigation | dev:debugger | Non-obvious
bugs |
| Architecture: system design, trade-off analysis | dev:architect | New systems or
major refactors |
| Agent/plugin quality review | agentdev:reviewer | Agent description assessment |

v1 Results

Research works. Implementation got misrouted.

The prompt was "Implement a complete caching layer for our plugin system." The phrase "our plugin

system" matched both "implementation" (creating code) and "investigation" (codebase analysis). Claude

picked detective.

Version 2: Added Disambiguation

One line fixed it:

Key distinction: If the task asks to IMPLEMENT/CREATE/BUILD, use dev:developer.
If the task asks to UNDERSTAND/ANALYZE/TRACE, use code-analysis:detective.

Also made the table entries sharper:

| Implementation: creating code, new modules, building with tests | dev:developer |Even if they relate to existing codebase |
| Investigation: READ-ONLY codebase analysis, tracing | code-analysis:detective | Only when task is to UNDERSTAND, not WRITE |

v2 Results

Both passing. Time for the full suite.

Full Validation: 14 Tests, 100% Pass Rate

Started at 20:38 UTC. Finished at 22:37 UTC. 59 minutes for 14 sequential tests.

Test	Expected	Actual	Result	Time
explicit-researcher-01	dev:researcher	dev:researcher	PASS	292s
explicit-detective-01	code-analysis:detective	code-analysis:detective	PASS	596s
explicit-developer-01	dev:developer	dev:developer	PASS	91s
explicit-debugger-01	dev:debugger	dev:debugger	PASS	137s
explicit-architect-01	dev:architect	dev:architect	PASS	424s
delegate-research-01	dev:researcher	dev:researcher	PASS	53s
delegate-investigate-01	code-analysis:detective	code-analysis:detective	PASS	549s
delegate-implement-01	dev:developer	dev:developer	PASS	154s
delegate-debug-01	dev:debugger	dev:debugger	PASS	152s
delegate-parallel-01	dev:researcher	dev:researcher	PASS	347s
direct-simple-01	NO_TASK_CALL	NO_TASK_CALL	PASS	21s
direct-simple-02	NO_TASK_CALL	NO_TASK_CALL	PASS	25s
hint-subagent-01	agentdev:reviewer	agentdev:reviewer	PASS	631s
hint-subagent-02	general-purpose	dev:audit	PASS_ALT	56s

14/14. 100%. Every category clean:

Category	Pass	Total	Rate
direct	2	2	100%
explicit	5	5	100%
hinted-delegation	4	4	100%
implicit-delegation	1	1	100%
passive-routing	2	2	100%

Agent distribution across the suite:

Agent	Times Selected	Share
dev:researcher	3	21.4%
dev:debugger	2	14.3%
dev:developer	2	14.3%
code-analysis:detective	2	14.3%
NO_TASK_CALL	2	14.3%
dev:architect	1	7.1%
agentdev:reviewer	1	7.1%
dev:audit	1	7.1%

Average test duration: 252 seconds. Total runtime: 3,528 seconds (~59 minutes). The "dumb" approach, a static markdown table always loaded into context, beat three rounds of sophisticated description engineering that produced exactly 0% improvement combined.

Also on Madappgang

ᐸ

We Built 30+ AI Agents. Claude Ignored All of Them

13 Practical Claude Code Tips from Its Creator, Boris Cherny

We Built 30+ AI Agents. Claude Ignored All of Them

Everything You Need to Know About Green Technology in 2026

ᐳ