cron-job-cost-design

/home/avalon/.hermes/skills/devops/cron-job-cost-design/SKILL.md · raw

Cron Job Cost Design

Alex runs autonomous Hermes loops 24/7 (Hermes Trader research worker, HD Wiki Swarm orchestrator, etc.). These are great until they silently eat his LLM credits and he ends up forced onto a fallback model. This skill is the cost-aware design and triage playbook.

Core rule

The cost of a cron job is roughly: runs_per_day × tokens_per_run × delegation_fanout.

A job scheduled every 10m with the delegation toolset is not the same as a script that runs every 10m. Treat them as different categories.

Type	Token cost	Use for
`no_agent=True` script-only (watchdog, recorder, ingest scanner)	zero LLM tokens	Deterministic checks, file scanners, status snapshots, "anything new?" probes, audit-trail recorders
LLM agent, minimal toolsets, sparse cadence (daily / hourly)	low	Daily digests, end-of-day summaries
LLM agent, frequent cadence (every 10–30m), broad toolsets	medium–high	Reserve for genuinely reasoning-heavy work that cannot be scripted
LLM agent with delegation toolset at high cadence	🔥 highest — multiplies with every spawned subagent	Almost never. Use only with hard repeat caps and a credit budget

When in doubt, default to script-only and let the script enqueue a single LLM call only when its deterministic check finds something worth reasoning about.

When creating a new recurring job

Before you call cronjob action=create, ask yourself:

Can this be a script? If the work is "scan a directory / check a DB row / poll an endpoint / record a metric," the answer is yes. Use no_agent=True with script=.... Empty stdout = silent. Non-empty stdout = delivered verbatim.
What's the per-run token budget? A reasoning agent at every 10m is 144 runs/day. At 30m it's 48. At daily cron it's 1. Pick the slowest cadence the use case tolerates.
Does it need delegation? If yes, set repeat to a finite number (repeat: 96 for a 48h capped loop, etc.), or pair with a parallel no_agent "did anything actually change?" gate so the LLM doesn't fire on no-op ticks.
Set enabled_toolsets to the minimum that works. terminal, file, web is usually enough. Don't enable delegation unless the job genuinely needs to fan out.
Use deliver=local for high-cadence research workers so chat doesn't spam. Use deliver=origin only for digests/summaries the user actually reads.
Document the audit trail. Each run should leave a footprint somewhere durable (DB row, manifest, swarm-state.json) so a future session can reconstruct progress without re-reading every cron output file.

The kill-switch playbook (token burn emergency)

When the user notices they're forced onto a fallback model or sees 429s in cron output:

List jobs first: cronjob action=list. Never guess job_ids.
Categorize by cost: look at model, enabled_toolsets, schedule, and script field. Flag any LLM-agent job at ≤30m cadence, especially ones with delegation in enabled_toolsets.
Pause everything by default when the user says "stop them all." Pausing keeps the configuration (resume is one call), removing deletes it. Default to pause unless explicitly told to remove.
Loop the pause calls in parallel — they're independent, fire them all in one batch.
After pausing, inspect each project's persistent state (DB, swarm-state.json, manifests) — that's where progress lives, not in the cron output transcripts.
Confirm via a final cronjob action=list that every job shows state: paused, enabled: false.

See scripts/triage-cron-cost.sh for a one-shot inventory + cost categorization.

Post-pause inspection — where progress actually lives

Cron output files (~/.hermes/cron/output/<job_id>/*.md) are just transcripts. The real state of long-running autonomous projects lives in their app DB or state file:

Hermes Trader (/home/avalon/apps/hermes-trader): SQLite data/hermes-trader.sqlite, tables research_cycles and research_tasks. Recent cycles + status counts tell you where the worker got to. Schema: cycles have title, hypothesis, status, findings_json; tasks have strategy_key, status, result_metrics_json, last_output.
HD Wiki Swarm (/home/avalon/hd-wiki-swarm): state/swarm-state.json (top-level keys: status, phase_cursors, active_jobs, selected_package, next_recommended_step, coverage, coverage_summary, blockers, notes) and inventory/package-ledger.json (status counts across all packages). The ledger's status_counts is the canonical "how far did we get" number.
Wiki / git-backed projects: git log -n 5 + the last few commit messages tell you what was published.

When the user asks "where did we get to," lead with the persistent state (X / Y packages complete, current selected_package, next_recommended_step), then the most recent cron transcript for context. Don't just paginate cron logs.

See references/hd-wiki-swarm-state-shape.md and references/hermes-trader-research-db.md for the exact field shapes observed in production.

Pitfalls

Delegation toolset multiplies cost. Every spawned subagent is another LLM call. If a 30m orchestrator spawns 3 subagents per tick, that's 144 reasoning calls/day, not 48. Watch for enabled_toolsets: [..., "delegation", ...] on frequent crons — this is the single highest-risk pattern.
Failure ≠ no cost. A cron job that hits 429 and errors out still consumed tokens up to the failure point. Pausing earlier saves more than pausing after the limit is hit.
deliver=origin on high-cadence jobs spams chat. Use deliver=local for workers, deliver=origin only for digests.
Don't conflate the script-only recorder with the LLM worker. Hermes Trader has both — pausing only the LLM worker keeps the audit trail flowing without burning credits. If the user says "pause everything" they mean both; if they say "stop the agents" they mean only the LLM ones.
Resuming all at once re-creates the burn. When the user is ready to resume, ask which jobs they want back. Default to script-only first, then daily digests, then high-cadence agents last (and ideally with a tightened cadence or repeat cap).
Don't add memory entries like "cron jobs burn tokens." That's a self-imposed constraint that ages badly. The lesson belongs here, where it's contextual to the task class.

Verification

After any pause/resume operation, always run:

cronjob action=list

and confirm the state and enabled fields on every job match intent before reporting back to the user.