hermes-tenant-control-plane

/home/avalon/.hermes/skills/software-development/hermes-tenant-control-plane/SKILL.md · raw

Hermes Tenant Control Plane

Alex runs two multi-tenant Hermes control planes that share ~80% of their architecture. Treat them as siblings: the same patterns, pitfalls, and conventions apply to both.

The Two Apps

Astral Hermes Hermes Spawn
Scope Vertical (astrology) Horizontal (general-purpose Hermes)
App path /home/avalon/apps/astral-hermes-platform /home/avalon/apps/hermes-spawn
PM2 process astral-hermes-web hermes-spawn
Public URL astral.apps.poofc.com spawn.apps.poofc.com
Port (control plane) varies (Express) 4031
Tenant root (worker) /srv/astral/tenants/<id> /home/avalon/hermes-spawn-tenants/<id>
Tenant container name astral-tenant-<id> spawn-tenant-<id>
Worker host avalon@5.78.199.26 avalon@5.78.199.26
GitHub firemountain/astral-hermes-platform firemountain/hermes-spawn
Domain skill bundles astral-core + astral-hd (versioned) user picks skills à la carte

Astral preceded Spawn architecturally; many Spawn patterns originated in Astral. Pre-existing implementations of "new" features are common — always grep first before delegating or writing fresh code.

Architectural conventions (apply to both apps)

Tenant container layout

Each tenant has a hermes-home dir on the worker (<TENANT_ROOT>/<id>/hermes-home/) containing:

The container runs astral-hermes-runner (Astral) or spawn-hermes-runner (Spawn) image and is started/stopped via the control plane's SSH-exec layer.

Provider system

Both apps use a providerConfig({ provider, model }) function returning { envName, config, note? }:

When adding a new provider, update providerConfig, the betaProviderMatrix / publicProviderOptions (Astral: src/provider-matrix.mjs), the wizard UI, the allowed-providers gate in /api/provision (Spawn: server.mjs ~line 818, "allowed = new Set([...])"), and security regression tests.

Auth shapes — the dual-write pattern

Hermes core resolves OAuth credentials from two locations in auth.json. Both must be written or some code paths (gateway, cron, CLI probes) silently fail.

For openai-codex (Codex device-auth):

# credential_pool shape — used by `hermes auth list` and status probes
store['credential_pool']['openai-codex'] = [entry, ...]
# providers shape — used by gateway/cron/model calls at runtime
store['providers']['openai-codex'] = {
    'tokens': {'access_token': ..., 'refresh_token': ...},
    'last_refresh': ...,
    'auth_mode': 'chatgpt',
    'base_url': ...,
}
store['active_provider'] = 'openai-codex'

Spawn writes both shapes in server.mjs ~line 274 (search for the Python heredoc that writes auth.json). Astral writes both shapes via installCodexCredential. If a tenant reports "No Codex credentials stored" while hermes auth list shows the credential exists, the runtime shape is missing — re-run device-auth completion or hand-patch. See references/codex-auth-shape-fix.md for the codyguy/cody1 incident.

For Anthropic OAuth (Astral only, helpers built):

# credential_pool shape
store['credential_pool']['anthropic'] = [entry, ...]
# Plus a separate .anthropic_oauth.json file at hermes-home root with
# {accessToken, refreshToken, expiresAt} — Hermes core reads this directly.

Provisioning flow (both apps)

  1. User submits wizard (account creation, tenant id, provider choice, optional Telegram bot, allowed user IDs).
  2. Control plane validates invite (Astral: INVITE_CODE; Spawn: gated user registration).
  3. SSH to worker, create <TENANT_ROOT>/<id>/hermes-home/, write config.yaml + .env.
  4. Install skill bundles or selected skills via the worker-side astral / spawn CLI helper.
  5. (Spawn) Create dedicated Hetzner S3 bucket with quota.
  6. Start the Docker container.
  7. If provider needs post-create auth (Codex / Anthropic OAuth), surface the device-auth step.
  8. Register entitlement in the ledger (src/entitlement-ledger.mjs in Astral; equivalent DB in Spawn).
  9. If billing enabled (Astral Stripe), create checkout session.

Skill bundles vs à la carte

Astral ships versioned bundles (astral-core@0.2.0, astral-hd@0.2.0) installed via node bin/astral.mjs bundle install --tenant <id> --bundle <name> --version <v> [--replace] [--restart]. Use this for any persona-shaped tenant set: write the bundle once, install across many tenants, version-pin.

Spawn lets users pick individual skills. There is a known cross-pollination idea: bring versioned bundles to Spawn so users can install "the trading bundle" or "the writing bundle" as a unit. Not built yet.

When changing what bundles are installed by default, update installProvisionBundles() in Astral (web/server.mjs ~line 804) AND backfill existing tenants by running astral bundle install per tenant. See the bundle-v0.2.0 backfill session for the loop pattern.

Role-based admin (Astral — landed 2026-05-22)

Astral previously used a shared ASTRAL_ADMIN_TOKEN bearer. Replaced with role-based auth driven by ASTRAL_ADMIN_EMAILS env var.

Key pieces:

To bootstrap on the live host: ASTRAL_ADMIN_EMAILS=firemountain@gmail.com in .env, then pm2 restart astral-hermes-web --update-env.

If extending the same pattern to Hermes Spawn, mirror the env-var-seeded role + applyAdminRoleFromEnv() approach.

Cross-app cross-pollination map

Things one app does well that the other can adopt:

The admin-role pattern just landed in Astral; if Spawn is asked for the same, port it directly.

Deploy & smoke (both apps, same shape)

# Astral
cd /home/avalon/apps/astral-hermes-platform/web && npm run build && cd .. && npm run test:security
git add -A && git commit -m "..." && git push origin main
pm2 restart astral-hermes-web --update-env
curl -s -o /dev/null -w "%{http_code}\n" https://astral.apps.poofc.com/api/health  # expect 200

# Spawn
cd /home/avalon/apps/hermes-spawn && npm test && npm run build
git add -A && git commit -m "..." && git push origin main
pm2 restart hermes-spawn --update-env
curl -s -o /dev/null -w "%{http_code}\n" https://spawn.apps.poofc.com/api/health  # expect 200

Always pm2 restart --update-env after any .env change, not just pm2 restart.

Pitfalls

  1. Provisioning seems fine but chat fails with "quota exceeded" on Codex tenant: Voice transcription fell back to control-plane OpenAI key that's quota-exhausted. Voice transcription is not automatically routed through Codex subscription — it uses ASTRAL_TRANSCRIPTION_OPENAI_KEY / VOICE_TOOLS_OPENAI_KEY. If that quota is dry, transcription silently fails for ALL subscription-auth tenants. Astral now shows a friendly "Voice transcription is temporarily unavailable" instead of the raw OpenAI billing URL.

  2. Tenant says "No Codex credentials stored" but hermes auth list shows them: The dual-write was incomplete. Re-run device-auth completion or hand-patch providers['openai-codex'].tokens + active_provider in auth.json. See references/codex-auth-shape-fix.md.

  3. Skills bundles "missing" after backfill: The skills were installed but terminal was disabled in platform_toolsets, OR web.backend: firecrawl had no credits. The skills can load but cannot execute the API calls they wrap. Verify terminal is enabled AND the relevant API keys are populated. The 2026-05-17 backfill (commit faa92e8 in Astral) re-enabled terminal for tenants and injected ASTRAL_TENANT_FIRECRAWL_API_KEY.

  4. Subagent timeout when adding a feature to one of these apps: see subagent-driven-development skill's pre-flight discovery section. Both repos accumulate "almost finished" features that need enabling rather than rebuilding. Always grep first.

  5. Telegram bot prefills credentials: Public onboarding/login UIs MUST NOT prefill remembered IDs, emails, or default passwords. Alex is security-conscious about this — explicit user-profile note.

  6. mayaastral tenant has a config shape difference: discovered during 2026-05-17 backfill. If a backfill loop succeeds for most tenants but mayaastral shows missing platform_toolsets.terminal, that tenant needs a manual pass.

  7. Agent claims to write to KB but file never appears on host: The astral-tenant-kb skill reads ASTRAL_KB_ROOT from the process environment, not the .env file. Hermes does NOT auto-load /data/hermes/.env into the child Python processes that skills spawn. Result: if -e ASTRAL_KB_ROOT=/data/hermes/knowledge is missing from docker run, the skill silently falls back to ~/.hermes/astral-kb-prototype/knowledge/ inside the writable container layer and the agent reports success while nothing persists to the mounted volume. - Fix: bake the env var into the docker run line in BOTH web/server.mjs (provisioning script ~line 1608) AND bin/astral.mjs (bundle install --restart path, line 178). Adding to .env is necessary but not sufficient; existing containers need docker rm -f && docker run (not just docker restart) because env can only be set at container creation. - Verify: docker exec <container> env | grep ASTRAL_KB_ROOT → must show the path. Then ssh write a probe file into the tenant's knowledge/raw/.probe and confirm the container sees it via docker exec <container> ls /data/hermes/knowledge/raw/. - Detection signal: agent says \"saved Kathleen Brown's chart to your knowledge base\" but find /srv/astral/tenants/<id>/hermes-home/knowledge/entities/people/ is empty. Same root cause every time. Commit 457b9aa in Astral.

  8. Tenant has no knowledge/ directory at all: Older tenants (notably mayaastral, the CHAT_DEFAULT_TENANT) were provisioned before the KB scaffold became a default. The scaffold is created by createKnowledgeScaffold({ tenantId, tenantName }) in web/server.mjs. Backfill via the admin endpoint: bash curl -X POST -b \"$ADMIN_COOKIE\" \\\n https://astral.apps.poofc.com/api/admin/tenants/<id>/ensure-knowledge\n\n Or via the one-off helper at scripts/scaffold-kb.mjs. The endpoint uses if not p.exists() guards so it's idempotent and safe to re-run.

  9. Cross-tenant isolation is real but verify after changes: Each tenant has its own <TENANT_ROOT>/<id>/hermes-home/ mounted at /data/hermes inside its dedicated container (astral-tenant-<id>). No shared volumes. Containers run with --cap-drop=ALL, --security-opt no-new-privileges:true, and --memory=1g. If you ever rewrite container creation, verify isolation by: (a) writing a probe file in tenant A's KB on the host, (b) listing the same path from tenant B's container — must be empty. requireChatTenantAccess and requireTenantOwner middleware also enforce app-level gates.

  10. Pre-existing helpers exist for \"new\" features — search before delegating: Recurring pattern in this codebase: Anthropic OAuth PKCE helpers (installAnthropicOauthCredential, exchangeAnthropicAuthorization, buildAnthropicAuthorizeUrl, generateAnthropicPkce) were already built but disabled via enabled: false in src/provider-matrix.mjs. Similarly, KB ingestion sessions existed in Spawn before they were surfaced. Always grep -rn '<feature>' web/server.mjs src/ BEFORE delegating implementation work; the subagent timeouts in this session were directly caused by re-exploring existing code.

Wizard UX principles (Astral onboarding — Typeform-style, landed 2026-05-22, commit 36023bd)

Alex's explicit preference for any multi-step wizard in these apps:

Tenant settings panel (Astral — landed 2026-05-22, commit 82887f3)

Route: /account/tenant/<tenantId>. Tabs: provider, telegram, danger zone.

Endpoints (all gated by requireTenantOwner which allows either the owning account OR any admin):

Key helpers:

OAuth re-auth from settings panel just calls the existing /api/provider/device-auth + /api/provider/complete (Codex) and /api/provider/anthropic/start + /api/provider/anthropic/complete (Anthropic) endpoints — they already work for existing tenants, not just new ones. installCodexCredential and installAnthropicOauthCredential overwrite the auth.json credential, which is the correct behavior for re-auth.

Chat UI patterns (Astral — landed 2026-05-22, commits dfe72d1 + edcb711)

Alex's explicit preferences for the chat surface:

References