cron-job-cost-design

/home/avalon/.hermes/skills/devops/cron-job-cost-design/SKILL.md · raw

Cron Job Cost Design

Alex runs autonomous Hermes loops 24/7 (Hermes Trader research worker, HD Wiki Swarm orchestrator, etc.). These are great until they silently eat his LLM credits and he ends up forced onto a fallback model. This skill is the cost-aware design and triage playbook.

Core rule

The cost of a cron job is roughly: runs_per_day × tokens_per_run × delegation_fanout.

A job scheduled every 10m with the delegation toolset is not the same as a script that runs every 10m. Treat them as different categories.

Type Token cost Use for
no_agent=True script-only (watchdog, recorder, ingest scanner) zero LLM tokens Deterministic checks, file scanners, status snapshots, "anything new?" probes, audit-trail recorders
LLM agent, minimal toolsets, sparse cadence (daily / hourly) low Daily digests, end-of-day summaries
LLM agent, frequent cadence (every 10–30m), broad toolsets medium–high Reserve for genuinely reasoning-heavy work that cannot be scripted
LLM agent with delegation toolset at high cadence 🔥 highest — multiplies with every spawned subagent Almost never. Use only with hard repeat caps and a credit budget

When in doubt, default to script-only and let the script enqueue a single LLM call only when its deterministic check finds something worth reasoning about.

When creating a new recurring job

Before you call cronjob action=create, ask yourself:

  1. Can this be a script? If the work is "scan a directory / check a DB row / poll an endpoint / record a metric," the answer is yes. Use no_agent=True with script=.... Empty stdout = silent. Non-empty stdout = delivered verbatim.
  2. What's the per-run token budget? A reasoning agent at every 10m is 144 runs/day. At 30m it's 48. At daily cron it's 1. Pick the slowest cadence the use case tolerates.
  3. Does it need delegation? If yes, set repeat to a finite number (repeat: 96 for a 48h capped loop, etc.), or pair with a parallel no_agent "did anything actually change?" gate so the LLM doesn't fire on no-op ticks.
  4. Set enabled_toolsets to the minimum that works. terminal, file, web is usually enough. Don't enable delegation unless the job genuinely needs to fan out.
  5. Use deliver=local for high-cadence research workers so chat doesn't spam. Use deliver=origin only for digests/summaries the user actually reads.
  6. Document the audit trail. Each run should leave a footprint somewhere durable (DB row, manifest, swarm-state.json) so a future session can reconstruct progress without re-reading every cron output file.

The kill-switch playbook (token burn emergency)

When the user notices they're forced onto a fallback model or sees 429s in cron output:

  1. List jobs first: cronjob action=list. Never guess job_ids.
  2. Categorize by cost: look at model, enabled_toolsets, schedule, and script field. Flag any LLM-agent job at ≤30m cadence, especially ones with delegation in enabled_toolsets.
  3. Pause everything by default when the user says "stop them all." Pausing keeps the configuration (resume is one call), removing deletes it. Default to pause unless explicitly told to remove.
  4. Loop the pause calls in parallel — they're independent, fire them all in one batch.
  5. After pausing, inspect each project's persistent state (DB, swarm-state.json, manifests) — that's where progress lives, not in the cron output transcripts.
  6. Confirm via a final cronjob action=list that every job shows state: paused, enabled: false.

See scripts/triage-cron-cost.sh for a one-shot inventory + cost categorization.

Post-pause inspection — where progress actually lives

Cron output files (~/.hermes/cron/output/<job_id>/*.md) are just transcripts. The real state of long-running autonomous projects lives in their app DB or state file:

When the user asks "where did we get to," lead with the persistent state (X / Y packages complete, current selected_package, next_recommended_step), then the most recent cron transcript for context. Don't just paginate cron logs.

See references/hd-wiki-swarm-state-shape.md and references/hermes-trader-research-db.md for the exact field shapes observed in production.

Pitfalls

Verification

After any pause/resume operation, always run:

cronjob action=list

and confirm the state and enabled fields on every job match intent before reporting back to the user.