subagent-driven-development

/home/avalon/.hermes/skills/software-development/subagent-driven-development/SKILL.md · raw

Subagent-Driven Development

Overview

Execute implementation plans by dispatching fresh subagents for outcome-sized milestones. Keep mechanical RED/GREEN steps inside each implementer brief. Use systematic review at meaningful boundaries rather than automatically spawning two reviewers after every helper or tiny task.

Core principle: Fresh context for a coherent outcome + controller verification + risk-adjusted review = high quality without process becoming the bottleneck.

Orchestration modes

Choose before dispatching:

Fast-track outcome mode (default for pure/fake/internal work): one implementer owns an end-to-end milestone, the controller runs focused/full gates, then one combined spec/security review happens at the boundary.
High-risk boundary mode: use separate spec then quality reviews when the milestone independently changes authentication, durable authority, destructive rollback, real infrastructure, customer data, or migration semantics.
Tiny repair mode: for a bounded reviewer finding, patch + regression test + targeted re-review; do not restart the full discovery/implementation ceremony.

Do not use the number of plan bullets to decide subagent count. Group adjacent bullets that touch the same files/state machine and share one acceptance test.

When to Use

Use this skill when: - You have an implementation plan (from writing-plans skill or user requirements) - Tasks are mostly independent - Quality and spec compliance are important - User asks to “spin up a subagent” for a focused design/code pass while the controller continues platform inspection or deployment work

vs. manual execution: - Fresh context per task (no confusion from accumulated state) - Automated review process catches issues early - Consistent quality checks across all tasks - Subagents can ask questions before starting work

The Process

1. Read and Parse Plan

Read the plan file. Extract ALL tasks with their full text and context upfront. Create a todo list:

# Read the plan
read_file("docs/plans/feature-plan.md")

# Create todo list with all tasks
todo([
    {"id": "task-1", "content": "Create User model with email field", "status": "pending"},
    {"id": "task-2", "content": "Add password hashing utility", "status": "pending"},
    {"id": "task-3", "content": "Create login endpoint", "status": "pending"},
])

Key: Read the plan ONCE. Extract everything. Don't make subagents read the plan file — provide the full task text directly in context.

2. Per-Task Workflow

For EACH task in the plan:

Step 1: Dispatch Implementer Subagent

Use delegate_task with complete context:

delegate_task(
    goal="Implement Task 1: Create User model with email and password_hash fields",
    context="""
    TASK FROM PLAN:
    - Create: src/models/user.py
    - Add User class with email (str) and password_hash (str) fields
    - Use bcrypt for password hashing
    - Include __repr__ for debugging

    FOLLOW TDD:
    1. Write failing test in tests/models/test_user.py
    2. Run: pytest tests/models/test_user.py -v (verify FAIL)
    3. Write minimal implementation
    4. Run: pytest tests/models/test_user.py -v (verify PASS)
    5. Run: pytest tests/ -q (verify no regressions)
    6. Commit: git add -A && git commit -m "feat: add User model with password hashing"

    PROJECT CONTEXT:
    - Python 3.11, Flask app in src/app.py
    - Existing models in src/models/
    - Tests use pytest, run from project root
    - bcrypt already in requirements.txt
    """,
    toolsets=['terminal', 'file']
)

Step 2: Dispatch Spec Compliance Reviewer

After the implementer completes, verify against the original spec:

delegate_task(
    goal="Review if implementation matches the spec from the plan",
    context="""
    ORIGINAL TASK SPEC:
    - Create src/models/user.py with User class
    - Fields: email (str), password_hash (str)
    - Use bcrypt for password hashing
    - Include __repr__

    CHECK:
    - [ ] All requirements from spec implemented?
    - [ ] File paths match spec?
    - [ ] Function signatures match spec?
    - [ ] Behavior matches expected?
    - [ ] Nothing extra added (no scope creep)?

    OUTPUT: PASS or list of specific spec gaps to fix.
    """,
    toolsets=['file']
)

If spec issues found: Fix gaps, then re-run spec review. Continue only when spec-compliant.

Step 3: Dispatch Code Quality Reviewer

After spec compliance passes:

delegate_task(
    goal="Review code quality for Task 1 implementation",
    context="""
    FILES TO REVIEW:
    - src/models/user.py
    - tests/models/test_user.py

    CHECK:
    - [ ] Follows project conventions and style?
    - [ ] Proper error handling?
    - [ ] Clear variable/function names?
    - [ ] Adequate test coverage?
    - [ ] No obvious bugs or missed edge cases?
    - [ ] No security issues?

    OUTPUT FORMAT:
    - Critical Issues: [must fix before proceeding]
    - Important Issues: [should fix]
    - Minor Issues: [optional]
    - Verdict: APPROVED or REQUEST_CHANGES
    """,
    toolsets=['file']
)

If quality issues found: Fix issues, re-review. Continue only when approved.

Step 4: Mark Complete

todo([{"id": "task-1", "content": "Create User model with email field", "status": "completed"}], merge=True)

3. Final Review

After ALL tasks are complete, dispatch a final integration reviewer:

delegate_task(
    goal="Review the entire implementation for consistency and integration issues",
    context="""
    All tasks from the plan are complete. Review the full implementation:
    - Do all components work together?
    - Any inconsistencies between tasks?
    - All tests passing?
    - Ready for merge?
    """,
    toolsets=['terminal', 'file']
)

4. Verify and Commit

# Run full test suite
pytest tests/ -q

# Review all changes
git diff --stat

# Final commit if needed
git add -A && git commit -m "feat: complete [feature name] implementation"

Controller acceptance after a subagent commit

A subagent summary, commit, and GREEN suite are claims, not milestone approval. This matters most when the worker authored both a side-effect harness and its tests: implementation and tests can share the same blind spots.

Before accepting a committed result:

Verify git status, git log, the commit stat, and exact changed files.
Read the changed implementation and tests; do not rely on test names or counts.
Map every safety requirement from the brief to a concrete code path and assertion.
Adversarially inspect boundaries the worker may have normalized away: raw secrets written temporarily, existence mistaken for exact ownership, persisted PIDs treated as authority, same-tenant cross-operation conflicts, and logical cleanup that leaves a process alive.
Add missing regressions and harden the code before approval.
Run focused tests, full tests, live/default-state invariants, production-wiring checks, and independent residual-process checks.
Commit controller hardening separately when it materially changes the implementation; this preserves the audit trail of what independent verification caught.

For local/synthetic infrastructure harnesses, load references/disposable-infrastructure-harness-review.md for the exact sandbox, ownership, secret, PID, isolation, and zero-residual gate.

Task Granularity

Delegate a coherent outcome that fits the worker’s time budget and has one clear acceptance gate. The prompt may contain many 2–5 minute RED/GREEN steps, but they are not separate delegations unless they touch independent files and can safely run in parallel.

Too fragmented: - one worker writes a validator; - another adds one state; - two reviewers inspect each micro-diff; - integration is deferred.

Right size: - "Complete durable compensation/manual-recovery/atomic terminal behavior in the store with focused and full tests." - "Integrate the fake executor with durable recovery and prove the critical crash matrix."

Split further only when the batch cannot complete inside the worker budget, files conflict, or a sub-boundary independently carries high risk.

Red Flags — Never Do These

Start implementation without a plan
Skip the required review at a meaningful milestone or high-risk authority boundary
Proceed with unfixed critical/important findings
Dispatch multiple implementation subagents for tasks that touch the same files
Treat every mechanical step as a separate delegation/review cycle; group coherent work by outcome
Update trackers/docs more often than working behavior advances
For parallel backend/frontend work, omit a shared endpoint/schema contract or assume separately passing unit tests prove integration. The controller owns an end-to-end API contract test and exact production-launch smoke.
Treat a timeout with no summary as proof that no files landed; reconcile the worktree and run targeted tests first.
Make subagent read the plan file (provide full text in context instead)
Skip scene-setting context (subagent needs to understand where the task fits)
Ignore subagent questions (answer before letting them proceed)
Accept "close enough" on spec compliance
Skip the repair/re-review loop when a milestone reviewer found critical/important issues
Let implementer self-review replace an independent review when the milestone requires one
When using separate two-stage review, start code quality only after spec compliance is PASS
Move to the next milestone while its required review still has open blockers
Collapse a comprehensive plan into one giant unverified edit. Outcome batching may combine adjacent slices, but the batch must stay time-bounded and end with integrated tests plus the appropriate milestone review.

Pre-flight discovery before delegating

Before writing the subagent goal/context, spend 30–60s in the controller doing a fast grep/read to learn what already exists for the task. Subagents that re-explore well-trodden code burn 20+ tool calls and frequently time out at 600s.

Concretely, for any "add feature X to repo Y" task:

search_files or grep for the feature's likely keywords (e.g. anthropic, Anthropic, OAuth, provider) across the target repo.
If matches show pre-existing helpers, endpoints, or matrix entries, your task changes shape: from "build X" to "enable / surface / smoke-test X."
State this discovery explicitly in the subagent's context field with a KEY FACT preamble:

"KEY FACT: Anthropic plumbing is ALREADY IMPLEMENTED in this repo. Do NOT redo it. Just enable and smoke-test it. - installAnthropicOauthCredential exists at server.mjs:250 - anthropic and anthropic-oauth entries already in provider-matrix.mjs but likely filtered out by an enabled: false flag."

Give precise line numbers, function names, and the suspected reason it isn't yet live. The subagent then dives straight to the right files instead of re-discovering them.

Failure mode to avoid: vague "add Anthropic as a provider" without pre-flight check → subagent reads the whole codebase trying to figure out what's there → 600s timeout. Same task with KEY FACT preamble: completes in ~3 minutes.

Pre-flight applies even to parallel batches: a 60s controller-side grep across both target repos is cheaper than 20 minutes of parallel subagent thrash.

Common pre-flight discoveries that reshape the task

When grep hits show prior work, the task usually mutates from "build X" to one of these:

Feature-flagged off. Code exists, helpers are wired, but a single enabled: false (or env-var gate, or filtered list) hides it from the public surface. The job is to flip the flag and smoke-test, not to re-author. Astral's provider-matrix.mjs had full anthropic + anthropic-oauth entries with enabled: false — flipping one boolean per entry surfaced them in /api/providers. Always grep for enabled, disabled, deferred, comingSoon, beta, etc. near the feature's data.
Server has it, client doesn't (or vice versa). A providers const in React/Vue, a label map, an enum, a switch statement — these are config-in-two-places traps. Find the server-side source of truth, then grep the client for the same provider/feature keys; if they don't match, the public UI silently omits the option even though the API supports it. Example: Astral /api/providers returned anthropic + anthropic-oauth after the matrix flip, but web/src/main.jsx had a hardcoded providers object listing only openai-codex and openai. User saw no Claude option on the wizard.
Helpers exist but no route calls them. Functions like installAnthropicOauthCredential, exchangeAnthropicAuthorization, buildAnthropicAuthorizeUrl were all present in server.mjs but no app.post('/api/provider/anthropic/...')` handler invoked them. Search for the helper name in route definitions; if zero matches, the feature is half-built and your job is the wiring, not the implementation.
Code shipped to one repo, missing from a sibling. Pre-flight both repos before parallel delegation. The Codex device-auth flow was complete in one and partial in the other.

Document the discovery as a KEY FACT preamble in the subagent context so it spends its budget on the right slice.

Iteration-budget watchdog

Subagents can hit max_iterations (default ~50 calls) before they finish, even when the code is fully written. When that happens:

The summary will say "exit_reason": "max_iterations" and explicitly call out work it did NOT complete (e.g. "Did not commit", "Did not deploy", "Did not run live smoke").
The controller MUST finish that work directly — do not re-delegate the same task and pay the discovery cost again. The subagent already did the hard part; the controller just does the final git commit && git push && pm2 restart && curl smoke lap.
Treat the subagent's recommended next steps as a checklist for the controller, not as instructions to spawn another agent.

Timeout ≠ no work done — always verify state before re-dispatching

If a subagent exits with status: "timeout" (e.g. after 600s with N completed API calls), DO NOT assume nothing landed. The timeout commonly fires on the final return-summary step after all the substantive work already completed on disk.

Before re-dispatching or telling the user the task failed, verify external state directly:

File/code work: git status -s, git log --oneline -5, and target-file inventory greps (e.g. grep -l '^## Headline' concepts/*.md | wc -l).
External API work (HTTP POST, S3 upload, DB write): query the target system directly for the artifact (URL, ID, row).
Long-running batch passes (N items): count completed items vs. expected — partial completion is a useful state, not a failure.

Documented case (decan-synthesis pass, 2026-05-22): a subagent timed out at 600s with 26 API calls — but had already edited all 36 files AND committed locally. Re-dispatching would have wasted ~10 minutes and risked a double-commit. Recovery was a 30-second git log check followed by a targeted second pass on just the contaminated subset.

This is a corollary of the general "subagent self-reports are claims, not facts" rule — verify externally. The novel twist with timeouts is that the absence of a self-report doesn't mean absence of side effects.

Hard-budget decomposition and partial-TDD recovery

A worker with a 10-minute ceiling should receive a task sized for 2–5 minutes of implementation, leaving room for discovery, tests, and commit. Do not bundle a storage kernel, multiple adapters, migration CLI, server cutover, docs, and every integration gate into one delegation.

After any timeout:

Reconcile the worktree before re-dispatching.
Run syntax/type checks before the focused suite. A timed-out worker may leave a nearly complete module that cannot parse (for example, a public method calling an undeclared private method); test output is otherwise noisy or misleading.
If only tests landed, treat them as a valid TDD RED scaffold: confirm the expected failure, then finish the bounded GREEN implementation directly.
If code landed without tests, add the missing regression test before acceptance.
If nothing landed, make the next task materially smaller rather than repeating the same prompt.
Do not keep re-delegating the final small remainder after the worker already paid the discovery cost; the controller may finish and verify that bounded remainder directly.
Audit required coverage at the assertion/path level, not by test names or green counts. A test titled “terminal transitions reject” may never reach a terminal state. Confirm each required state/edge/failure path is actually constructed and asserted; add the missing regression before commit.
When a bounded task still grows into a large schema migration plus state machine, split the next similar task into separate checkpoints: schema/read compatibility first, mutation/state transitions second, public integration third.

Fresh worktrees also need dependency-path verification: relative file: package dependencies can resolve against .worktrees/... instead of the real sibling checkout. Repair only ignored local dependency links for verification; do not rewrite manifests or commit node_modules for a topology-specific issue.

See references/timeboxed-worktree-recovery.md for the reconciliation decision tree, incremental task shapes, and safe worktree dependency repair.

Truthful liveness and visible progress

For chat-driven execution—especially Telegram—external trackers and background workers do not replace visible chat checkpoints.

Delegation liveness watchdog and stalled-worker recovery

A successful delegate_task response proves dispatch, not continuing progress. For user-visible build work, establish a real liveness checkpoint instead of leaving “in progress” labels untouched indefinitely.

When the user asks “how is it going?” or a bounded worker has been silent for roughly 10–15 minutes, inspect the shared worktree directly: current UTC time, git status, recent commit, diff stat, recent target-file mtimes, and attributable test/build processes. Report built/not-built, committed/not-committed, and tested/not-tested first.
A batch delegation returns only after every child completes, so one stalled child can hide useful completed siblings. When per-lane visibility matters, prefer separate single-task delegations over one tasks=[...] batch, or make every batch child small and bounded.
There is no mid-flight conversational “nudge” channel for an already-dispatched child. Never claim to have nudged one merely because status was checked. If there are no fresh coherent writes, no attributable process, no commit, and no completion delivery, classify it as stalled and recover from filesystem evidence.
Recovery order: reconcile the worktree; run syntax/type checks; run the smallest focused tests; treat useful failing tests as a TDD RED scaffold; finish the bounded GREEN remainder directly or dispatch a fresh, materially smaller recovery worker with exact failures and files. Do not restart discovery or discard good partial work.
If a replacement worker is dispatched, say explicitly that the original worker stalled and that the replacement received a smaller recovery brief. Do not describe the original worker as still active.
Do not let a status check become the only action. If the lane is stalled, start the recovery test or replacement delegation in the same response.

Never say “running” unless work is actually active. Ground liveness in a dispatched subagent, tracked background process, or tool call executing in the current turn. If the prior response ended and no worker/process remains active, say the work is paused or not yet started.
A status report is not execution. If the user says “continue,” “keep going,” or asks whether work is running, immediately make the next concrete tool call in the same response rather than merely restating the plan.
Send concise checkpoints at meaningful transitions: - implementation/repair started; - RED reproduced; - focused tests GREEN; - full suite GREEN and commit created; - spec/security review dispatched; - reviewer blocker found and exact repair started; - final approval or explicit blocker.
Do not imply internal tool output streams automatically. Telegram users see explicit assistant checkpoint messages and asynchronous completion deliveries, not every terminal/file operation. Say this plainly when relevant.
Do not fabricate heartbeat updates. While a background delegation is active, report its real dispatched goal once and let its completion notification return. Do not claim intermediate progress that cannot be inspected, and do not poll background delegations.
Update durable trackers after verified state changes, not instead of chat updates. Checkboxes remain unchecked while work is active/in review/blocked; status text records the real blocker and next direction. A tracker update should accompany a chat checkpoint when the user asked for streaming visibility. Choose exactly one canonical tracker page per implementation program: never mirror the same mutable checklist/status block across multiple research, idea, or planning pages. If duplicate trackers already exist, keep the user-selected canonical page, remove the checklist from every other page, and verify the public URLs directly (canonical page has one tracker; secondary pages have zero). Secondary pages may link to the canonical tracker but must not carry another independently updated copy.
Correct liveness mistakes directly. If silence reflected a genuine pause, admit it without euphemism, start the next action, and then provide evidence-backed checkpoints.
Audit stale “in progress” labels on demand. When a tracker has shown “in progress” for a long time, do not repeat the label. Check current time, candidate file mtimes/sizes, git status, recent commits, focused-test artifacts, and any tracked process/delegation evidence. Classify the work as actively progressing, stalled, or finished-but-not-reported. A missing normal OS process is not decisive for API-backed delegations; fresh coherent file writes are stronger progress evidence. Conversely, an old tracker label with no fresh repository/process evidence is not proof of activity.
Expose the checkpoint, not just the phase. Update the durable tracker with a UTC timestamp and concrete evidence such as “RED test scaffold added,” exact files, last verified test count, and what does not exist yet. Keep the phase marked active only while a worker/process is truly active or fresh evidence shows current progress. This prevents a long-lived milestone label from masquerading as present-tense execution.

See references/stale-in-progress-liveness-audit.md for the compact evidence hierarchy, classification rules, and tracker wording pattern.

Handling Issues

If Subagent Asks Questions

Answer clearly and completely
Provide additional context if needed
Don't rush them into implementation

If Reviewer Finds Issues

Implementer subagent (or a new one) fixes them
Reviewer reviews again
Repeat until approved
Don't skip the re-review

If Subagent Fails a Task

If no useful implementation or diagnosis landed, dispatch a new, smaller fix subagent with specific instructions about what went wrong.
If the worktree already contains a coherent partial implementation or TDD RED scaffold, do not pay the discovery cost again: the controller may finish the bounded remainder, then run focused and full verification.
Never accept partial work merely because the worker timed out; reconcile, complete, review, and verify it.

Efficiency Notes

Why fresh context per outcome: - prevents context pollution without repeatedly paying discovery cost for adjacent work; - gives one worker enough ownership to integrate the behavior; - keeps controller verification focused on an externally meaningful result.

Why risk-adjusted review: - milestone review catches integration/spec/security issues together; - separate spec then quality stages remain valuable for independently high-risk authority boundaries; - tiny repairs need targeted re-review, not a complete ceremony restart.

Cost trade-off: - More review is not automatically more safety. Repeatedly reviewing micro-diffs can delay integration and create status/document churn. - Spend review budget where a coherent behavior can actually be evaluated. Preserve strict gates before real effects, customer data, migration, or destructive actions.

Focused UI/UX Subagent Pattern

When using a specialist Hermes profile (for example HERMES_PROFILE=build hermes chat ...) for coding from Telegram, do not run it as an unlogged fire-and-forget terminal(background=true) job. Process registry entries can disappear after completion/restart, leaving no final transcript in the parent turn. Prefer one of these durable patterns instead:

Foreground terminal() with a generous timeout when the task is expected to finish inside the tool limit.
tmux session with a named pane plus periodic tmux capture-pane progress checks.
Background shell wrapper that tees stdout/stderr to a known log file under /tmp/hermes-<task>.log, then always inspect the log and repo state before continuing.

If a spawned profile process vanishes or times out, immediately reconcile with git status -sb, target file reads/searches, and any log/session ID before reporting status. Never assume no work happened, and never rely solely on the process table for the final result.

When Alex explicitly asks to “spin up a subagent” for a web UI/design cleanup, use a focused implementer subagent rather than doing the visual pass in the controller session. Give it:

exact project path and stack
live URL and deployment context
visual direction (e.g. “Typeform-inspired, minimal light-gray, mobile-first, step-by-step”)
product copy requirements and user-instruction requirements
constraints about preserving existing API behavior and avoiding secrets exposure
expected local verification (npm run build, touched files, commit message if appropriate)

After the subagent returns, the controller must still:

Inspect the resulting git diff/log.
Run the build/tests directly.
Restart/deploy PM2/nginx as needed.
Verify public health/root URL.
Commit/push if the subagent did not already do so, or push its commit if it did.
Summarize both the subagent result and controller verification.

Do not treat the subagent’s success as deployment verification; it is an implementation pass, not the final release gate.

Integration with Other Skills

With writing-plans

This skill EXECUTES plans created by the writing-plans skill: 1. User requirements → writing-plans → implementation plan 2. Implementation plan → subagent-driven-development → working code

With test-driven-development

Implementer subagents should follow TDD: 1. Write failing test first 2. Implement minimal code 3. Verify test passes 4. Commit

Include TDD instructions in every implementer context.

With requesting-code-review

The two-stage review process IS the code review. For final integration review, use the requesting-code-review skill's review dimensions.

With systematic-debugging

If a subagent encounters bugs during implementation: 1. Follow systematic-debugging process 2. Find root cause before fixing 3. Write regression test 4. Resume implementation

Example Workflow

[Read plan: docs/plans/auth-feature.md]
[Create todo list with 5 tasks]

--- Task 1: Create User model ---
[Dispatch implementer subagent]
  Implementer: "Should email be unique?"
  You: "Yes, email must be unique"
  Implementer: Implemented, 3/3 tests passing, committed.

[Dispatch spec reviewer]
  Spec reviewer: ✅ PASS — all requirements met

[Dispatch quality reviewer]
  Quality reviewer: ✅ APPROVED — clean code, good tests

[Mark Task 1 complete]

--- Task 2: Password hashing ---
[Dispatch implementer subagent]
  Implementer: No questions, implemented, 5/5 tests passing.

[Dispatch spec reviewer]
  Spec reviewer: ❌ Missing: password strength validation (spec says "min 8 chars")

[Implementer fixes]
  Implementer: Added validation, 7/7 tests passing.

[Dispatch spec reviewer again]
  Spec reviewer: ✅ PASS

[Dispatch quality reviewer]
  Quality reviewer: Important: Magic number 8, extract to constant
  Implementer: Extracted MIN_PASSWORD_LENGTH constant
  Quality reviewer: ✅ APPROVED

[Mark Task 2 complete]

... (continue for all tasks)

[After all tasks: dispatch final integration reviewer]
[Run full test suite: all passing]
[Done!]

Remember

Fresh context per coherent outcome
TDD microsteps inside the implementer brief
Controller verifies real execution output
Review at meaningful/risky boundaries
One bounded repair loop for blockers
Never trade hard safety gates for speed
Never let ceremony hide the short critical path

Quality comes from tested outcomes and well-placed gates—not the number of review cycles.