--- name: cron-job-cost-design description: Design and triage scheduled cron jobs for autonomous Hermes agent loops (research workers, wiki ingest swarms, monitors) so they don't quietly burn LLM credits. Covers cadence/toolset cost analysis, script-only vs LLM-agent tradeoffs, the pause-everything kill switch, and post-mortem inspection. when_to_use: Use when creating a new recurring background job, when the user reports unexpected quota/credit burn (HTTP 429, "usage limit reached"), when auditing why an autonomous loop went silent, or when picking which scheduled work to pause vs keep. --- # Cron Job Cost Design Alex runs autonomous Hermes loops 24/7 (Hermes Trader research worker, HD Wiki Swarm orchestrator, etc.). These are great until they silently eat his LLM credits and he ends up forced onto a fallback model. This skill is the cost-aware design and triage playbook. ## Core rule **The cost of a cron job is roughly: `runs_per_day × tokens_per_run × delegation_fanout`.** A job scheduled every 10m with the delegation toolset is **not the same** as a script that runs every 10m. Treat them as different categories. | Type | Token cost | Use for | |---|---|---| | `no_agent=True` script-only (watchdog, recorder, ingest scanner) | **zero LLM tokens** | Deterministic checks, file scanners, status snapshots, "anything new?" probes, audit-trail recorders | | LLM agent, minimal toolsets, sparse cadence (daily / hourly) | low | Daily digests, end-of-day summaries | | LLM agent, frequent cadence (every 10–30m), broad toolsets | **medium–high** | Reserve for genuinely reasoning-heavy work that cannot be scripted | | LLM agent **with delegation toolset** at high cadence | **🔥 highest — multiplies with every spawned subagent** | Almost never. Use only with hard repeat caps and a credit budget | When in doubt, **default to script-only** and let the script enqueue a single LLM call only when its deterministic check finds something worth reasoning about. ## When creating a new recurring job Before you call `cronjob action=create`, ask yourself: 1. **Can this be a script?** If the work is "scan a directory / check a DB row / poll an endpoint / record a metric," the answer is yes. Use `no_agent=True` with `script=...`. Empty stdout = silent. Non-empty stdout = delivered verbatim. 2. **What's the per-run token budget?** A reasoning agent at every 10m is **144 runs/day**. At 30m it's 48. At daily cron it's 1. Pick the slowest cadence the use case tolerates. 3. **Does it need delegation?** If yes, set `repeat` to a finite number (`repeat: 96` for a 48h capped loop, etc.), or pair with a parallel `no_agent` "did anything actually change?" gate so the LLM doesn't fire on no-op ticks. 4. **Set `enabled_toolsets` to the minimum that works.** `terminal`, `file`, `web` is usually enough. Don't enable `delegation` unless the job genuinely needs to fan out. 5. **Use `deliver=local`** for high-cadence research workers so chat doesn't spam. Use `deliver=origin` only for digests/summaries the user actually reads. 6. **Document the audit trail.** Each run should leave a footprint somewhere durable (DB row, manifest, swarm-state.json) so a future session can reconstruct progress without re-reading every cron output file. ## The kill-switch playbook (token burn emergency) When the user notices they're forced onto a fallback model or sees 429s in cron output: 1. **List jobs first:** `cronjob action=list`. Never guess job_ids. 2. **Categorize by cost:** look at `model`, `enabled_toolsets`, `schedule`, and `script` field. Flag any LLM-agent job at ≤30m cadence, especially ones with `delegation` in enabled_toolsets. 3. **Pause everything by default** when the user says "stop them all." Pausing keeps the configuration (resume is one call), removing deletes it. Default to pause unless explicitly told to remove. 4. **Loop the pause calls in parallel** — they're independent, fire them all in one batch. 5. **After pausing, inspect each project's persistent state** (DB, swarm-state.json, manifests) — that's where progress lives, not in the cron output transcripts. 6. **Confirm via a final `cronjob action=list`** that every job shows `state: paused, enabled: false`. See `scripts/triage-cron-cost.sh` for a one-shot inventory + cost categorization. ## Post-pause inspection — where progress actually lives Cron output files (`~/.hermes/cron/output//*.md`) are just transcripts. The real state of long-running autonomous projects lives in their app DB or state file: - **Hermes Trader** (`/home/avalon/apps/hermes-trader`): SQLite `data/hermes-trader.sqlite`, tables `research_cycles` and `research_tasks`. Recent cycles + status counts tell you where the worker got to. Schema: cycles have `title, hypothesis, status, findings_json`; tasks have `strategy_key, status, result_metrics_json, last_output`. - **HD Wiki Swarm** (`/home/avalon/hd-wiki-swarm`): `state/swarm-state.json` (top-level keys: `status, phase_cursors, active_jobs, selected_package, next_recommended_step, coverage, coverage_summary, blockers, notes`) and `inventory/package-ledger.json` (status counts across all packages). The ledger's `status_counts` is the canonical "how far did we get" number. - **Wiki / git-backed projects**: `git log -n 5` + the last few commit messages tell you what was published. When the user asks "where did we get to," lead with the **persistent state** (X / Y packages complete, current selected_package, next_recommended_step), then the **most recent cron transcript** for context. Don't just paginate cron logs. See `references/hd-wiki-swarm-state-shape.md` and `references/hermes-trader-research-db.md` for the exact field shapes observed in production. ## Pitfalls - **Delegation toolset multiplies cost.** Every spawned subagent is another LLM call. If a 30m orchestrator spawns 3 subagents per tick, that's **144 reasoning calls/day**, not 48. Watch for `enabled_toolsets: [..., "delegation", ...]` on frequent crons — this is the single highest-risk pattern. - **Failure ≠ no cost.** A cron job that hits 429 and errors out still consumed tokens up to the failure point. Pausing earlier saves more than pausing after the limit is hit. - **`deliver=origin` on high-cadence jobs spams chat.** Use `deliver=local` for workers, `deliver=origin` only for digests. - **Don't conflate the script-only recorder with the LLM worker.** Hermes Trader has both — pausing only the LLM worker keeps the audit trail flowing without burning credits. If the user says "pause everything" they mean both; if they say "stop the agents" they mean only the LLM ones. - **Resuming all at once re-creates the burn.** When the user is ready to resume, ask which jobs they want back. Default to script-only first, then daily digests, then high-cadence agents last (and ideally with a tightened cadence or repeat cap). - **Don't add memory entries like "cron jobs burn tokens."** That's a self-imposed constraint that ages badly. The lesson belongs *here*, where it's contextual to the task class. ## Verification After any pause/resume operation, always run: ``` cronjob action=list ``` and confirm the `state` and `enabled` fields on every job match intent before reporting back to the user.