hermes-tenant-control-plane

/home/avalon/.hermes/skills/software-development/hermes-tenant-control-plane/SKILL.md · raw

Hermes Tenant Control Plane

Alex runs two multi-tenant Hermes control planes that share ~80% of their architecture. Treat them as siblings: the same patterns, pitfalls, and conventions apply to both.

The Two Apps

	Astral Hermes	Hermes Spawn
Scope	Vertical (astrology)	Horizontal (general-purpose Hermes)
App path	`/home/avalon/apps/astral-hermes-platform`	`/home/avalon/apps/hermes-spawn`
PM2 process	`astral-hermes-web`	`hermes-spawn`
Public URL	`astral.apps.poofc.com`	`spawn.apps.poofc.com`
Port (control plane)	varies (Express)	`4031`
Tenant root (worker)	`/srv/astral/tenants/<id>`	`/home/avalon/hermes-spawn-tenants/<id>`
Tenant container name	`astral-tenant-<id>`	`spawn-tenant-<id>`
Worker host	`avalon@5.78.199.26`	`avalon@5.78.199.26`
GitHub	`firemountain/astral-hermes-platform`	`firemountain/hermes-spawn`
Domain skill bundles	`astral-core` + `astral-hd` (versioned)	user picks skills à la carte

Astral preceded Spawn architecturally; many Spawn patterns originated in Astral. Pre-existing implementations of "new" features are common — always grep first before delegating or writing fresh code.

Architectural conventions (apply to both apps)

Tenant container layout

Each tenant has a hermes-home dir on the worker (<TENANT_ROOT>/<id>/hermes-home/) containing:

config.yaml — Hermes config (model provider, skills, toolsets, storage, etc.)
.env — tenant secrets (API keys, Telegram tokens)
auth.json — provider OAuth/credential pool (see "Auth shapes" below)
data/hermes/knowledge/ — tenant KB (Spawn; Astral has its own variant)
data/hermes/scripts/ — guard/util scripts (S3 quota guard etc.)

The container runs astral-hermes-runner (Astral) or spawn-hermes-runner (Spawn) image and is started/stopped via the control plane's SSH-exec layer.

Provider system

Both apps use a providerConfig({ provider, model }) function returning { envName, config, note? }:

openai (or openai-api in Spawn): envName: OPENAI_API_KEY, provider: custom w/ base_url: https://api.openai.com/v1, api_mode: chat_completions
openai-codex: ChatGPT subscription OAuth. envName: '' (no env var). provider: openai-codex. Requires post-create device-auth.
anthropic: envName: ANTHROPIC_API_KEY. provider: anthropic, api_mode: anthropic_messages, default claude-sonnet-4-5. Working in both apps as of 2026-05-22.
anthropic-oauth: Claude Pro/Max subscription via PKCE OAuth. Astral has helpers built (installAnthropicOauthCredential, exchangeAnthropicAuthorization, buildAnthropicAuthorizeUrl, generateAnthropicPkce) but no completion Express route or UI flow yet. Spawn marks it oauth-unavailable with a clear "use API-key path for now" message pointing users to hermes auth add anthropic --type oauth inside the container.
openrouter: envName: OPENROUTER_API_KEY. provider: openrouter.

When adding a new provider, update providerConfig, the betaProviderMatrix / publicProviderOptions (Astral: src/provider-matrix.mjs), the wizard UI, the allowed-providers gate in /api/provision (Spawn: server.mjs ~line 818, "allowed = new Set([...])"), and security regression tests.

Auth shapes — the dual-write pattern

Hermes core resolves OAuth credentials from two locations in auth.json. Both must be written or some code paths (gateway, cron, CLI probes) silently fail.

For openai-codex (Codex device-auth):

# credential_pool shape — used by `hermes auth list` and status probes
store['credential_pool']['openai-codex'] = [entry, ...]
# providers shape — used by gateway/cron/model calls at runtime
store['providers']['openai-codex'] = {
    'tokens': {'access_token': ..., 'refresh_token': ...},
    'last_refresh': ...,
    'auth_mode': 'chatgpt',
    'base_url': ...,
}
store['active_provider'] = 'openai-codex'

Spawn writes both shapes in server.mjs ~line 274 (search for the Python heredoc that writes auth.json). Astral writes both shapes via installCodexCredential. If a tenant reports "No Codex credentials stored" while hermes auth list shows the credential exists, the runtime shape is missing — re-run device-auth completion or hand-patch. See references/codex-auth-shape-fix.md for the codyguy/cody1 incident.

For Anthropic OAuth (Astral only, helpers built):

# credential_pool shape
store['credential_pool']['anthropic'] = [entry, ...]
# Plus a separate .anthropic_oauth.json file at hermes-home root with
# {accessToken, refreshToken, expiresAt} — Hermes core reads this directly.

Provisioning flow (both apps)

User submits wizard (account creation, tenant id, provider choice, optional Telegram bot, allowed user IDs).
Control plane validates invite (Astral: INVITE_CODE; Spawn: gated user registration).
SSH to worker, create <TENANT_ROOT>/<id>/hermes-home/, write config.yaml + .env.
Install skill bundles or selected skills via the worker-side astral / spawn CLI helper.
(Spawn) Create dedicated Hetzner S3 bucket with quota.
Start the Docker container.
If provider needs post-create auth (Codex / Anthropic OAuth), surface the device-auth step.
Register entitlement in the ledger (src/entitlement-ledger.mjs in Astral; equivalent DB in Spawn).
If billing enabled (Astral Stripe), create checkout session.

Skill bundles vs à la carte

Astral ships versioned bundles (astral-core@0.2.0, astral-hd@0.2.0) installed via node bin/astral.mjs bundle install --tenant <id> --bundle <name> --version <v> [--replace] [--restart]. Use this for any persona-shaped tenant set: write the bundle once, install across many tenants, version-pin.

Spawn lets users pick individual skills. There is a known cross-pollination idea: bring versioned bundles to Spawn so users can install "the trading bundle" or "the writing bundle" as a unit. Not built yet.

When changing what bundles are installed by default, update installProvisionBundles() in Astral (web/server.mjs ~line 804) AND backfill existing tenants by running astral bundle install per tenant. See the bundle-v0.2.0 backfill session for the loop pattern.

Role-based admin (Astral — landed 2026-05-22)

Astral previously used a shared ASTRAL_ADMIN_TOKEN bearer. Replaced with role-based auth driven by ASTRAL_ADMIN_EMAILS env var.

Key pieces:

Ledger (src/entitlement-ledger.mjs): accounts have a role field ('admin' | 'user'), default 'user'. Methods: setAccountRole(id, role), safeAccount() includes role. upsertAccount preserves existing role; new accounts default to 'user'.
Server (web/server.mjs): ADMIN_EMAILS = new Set(...) parsed at module load from ASTRAL_ADMIN_EMAILS (comma-separated, lowercased). requireAdmin middleware checks session cookie + account.role === 'admin' → 401 if logged out, 403 if logged in but not admin. The middleware NAME is preserved so all /api/admin/* routes are unchanged.
Promotion: applyAdminRoleFromEnv(account) runs in /api/auth/register AND /api/auth/login so accounts that pre-date being added to the env list get promoted at next login.
UI: AdminApp fetches /api/auth/me on mount. Logged out → redirect /account?next=/admin. Logged in but role !== 'admin' → "Not authorized" screen. Admin → normal UI with a "View as user" button. Clicking it sets localStorage.astralAdminViewAsUser=true and routes to /account. The ReturnToAdminPill component renders at the app root and shows a fixed top-right "← Return to admin" pill on /account and /chat whenever the flag is set AND /api/auth/me confirms admin role. Clicking clears the flag.
Deprecation: ASTRAL_ADMIN_TOKEN env is still read so old deploys don't crash, but it grants nothing. The "Unlock admin" form, astralAdminToken localStorage, and Bearer header injection were fully removed.

To bootstrap on the live host: ASTRAL_ADMIN_EMAILS=firemountain@gmail.com in .env, then pm2 restart astral-hermes-web --update-env.

If extending the same pattern to Hermes Spawn, mirror the env-var-seeded role + applyAdminRoleFromEnv() approach.

Cross-app cross-pollination map

Things one app does well that the other can adopt:

Astral → Spawn: versioned skill bundles; polished image rendering (resvg + custom fonts via svghanddraw); post-provision capability smoke loop ("can this tenant call its KB? S3? Telegram bot?"); shared backend services pattern (Astral's transit-list-demo as a tenant-callable shared service).
Spawn → Astral: per-tenant KB ingestion sessions (durable async jobs, transcript-style UI, interrupt button); per-tenant S3 buckets + quotas; gated user registration; the auth dual-write fix.

The admin-role pattern just landed in Astral; if Spawn is asked for the same, port it directly.

Deploy & smoke (both apps, same shape)

# Astral
cd /home/avalon/apps/astral-hermes-platform/web && npm run build && cd .. && npm run test:security
git add -A && git commit -m "..." && git push origin main
pm2 restart astral-hermes-web --update-env
curl -s -o /dev/null -w "%{http_code}\n" https://astral.apps.poofc.com/api/health  # expect 200

# Spawn
cd /home/avalon/apps/hermes-spawn && npm test && npm run build
git add -A && git commit -m "..." && git push origin main
pm2 restart hermes-spawn --update-env
curl -s -o /dev/null -w "%{http_code}\n" https://spawn.apps.poofc.com/api/health  # expect 200

Always pm2 restart --update-env after any .env change, not just pm2 restart.

Pitfalls

Provisioning seems fine but chat fails with "quota exceeded" on Codex tenant: Voice transcription fell back to control-plane OpenAI key that's quota-exhausted. Voice transcription is not automatically routed through Codex subscription — it uses ASTRAL_TRANSCRIPTION_OPENAI_KEY / VOICE_TOOLS_OPENAI_KEY. If that quota is dry, transcription silently fails for ALL subscription-auth tenants. Astral now shows a friendly "Voice transcription is temporarily unavailable" instead of the raw OpenAI billing URL.
Tenant says "No Codex credentials stored" but hermes auth list shows them: The dual-write was incomplete. Re-run device-auth completion or hand-patch providers['openai-codex'].tokens + active_provider in auth.json. See references/codex-auth-shape-fix.md.
Skills bundles "missing" after backfill: The skills were installed but terminal was disabled in platform_toolsets, OR web.backend: firecrawl had no credits. The skills can load but cannot execute the API calls they wrap. Verify terminal is enabled AND the relevant API keys are populated. The 2026-05-17 backfill (commit faa92e8 in Astral) re-enabled terminal for tenants and injected ASTRAL_TENANT_FIRECRAWL_API_KEY.
Subagent timeout when adding a feature to one of these apps: see subagent-driven-development skill's pre-flight discovery section. Both repos accumulate "almost finished" features that need enabling rather than rebuilding. Always grep first.
Telegram bot prefills credentials: Public onboarding/login UIs MUST NOT prefill remembered IDs, emails, or default passwords. Alex is security-conscious about this — explicit user-profile note.
mayaastral tenant has a config shape difference: discovered during 2026-05-17 backfill. If a backfill loop succeeds for most tenants but mayaastral shows missing platform_toolsets.terminal, that tenant needs a manual pass.
Agent claims to write to KB but file never appears on host: The astral-tenant-kb skill reads ASTRAL_KB_ROOT from the process environment, not the .env file. Hermes does NOT auto-load /data/hermes/.env into the child Python processes that skills spawn. Result: if -e ASTRAL_KB_ROOT=/data/hermes/knowledge is missing from docker run, the skill silently falls back to ~/.hermes/astral-kb-prototype/knowledge/ inside the writable container layer and the agent reports success while nothing persists to the mounted volume. - Fix: bake the env var into the docker run line in BOTH web/server.mjs (provisioning script ~line 1608) AND bin/astral.mjs (bundle install --restart path, line 178). Adding to .env is necessary but not sufficient; existing containers need docker rm -f && docker run (not just docker restart) because env can only be set at container creation. - Verify: docker exec <container> env | grep ASTRAL_KB_ROOT → must show the path. Then ssh write a probe file into the tenant's knowledge/raw/.probe and confirm the container sees it via docker exec <container> ls /data/hermes/knowledge/raw/. - Detection signal: agent says \"saved Kathleen Brown's chart to your knowledge base\" but find /srv/astral/tenants/<id>/hermes-home/knowledge/entities/people/ is empty. Same root cause every time. Commit 457b9aa in Astral.
Tenant has no knowledge/ directory at all: Older tenants (notably mayaastral, the CHAT_DEFAULT_TENANT) were provisioned before the KB scaffold became a default. The scaffold is created by createKnowledgeScaffold({ tenantId, tenantName }) in web/server.mjs. Backfill via the admin endpoint: bash curl -X POST -b \"$ADMIN_COOKIE\" \\\n https://astral.apps.poofc.com/api/admin/tenants/<id>/ensure-knowledge\n\n Or via the one-off helper at scripts/scaffold-kb.mjs. The endpoint uses if not p.exists() guards so it's idempotent and safe to re-run.
Cross-tenant isolation is real but verify after changes: Each tenant has its own <TENANT_ROOT>/<id>/hermes-home/ mounted at /data/hermes inside its dedicated container (astral-tenant-<id>). No shared volumes. Containers run with --cap-drop=ALL, --security-opt no-new-privileges:true, and --memory=1g. If you ever rewrite container creation, verify isolation by: (a) writing a probe file in tenant A's KB on the host, (b) listing the same path from tenant B's container — must be empty. requireChatTenantAccess and requireTenantOwner middleware also enforce app-level gates.
Pre-existing helpers exist for \"new\" features — search before delegating: Recurring pattern in this codebase: Anthropic OAuth PKCE helpers (installAnthropicOauthCredential, exchangeAnthropicAuthorization, buildAnthropicAuthorizeUrl, generateAnthropicPkce) were already built but disabled via enabled: false in src/provider-matrix.mjs. Similarly, KB ingestion sessions existed in Spawn before they were surfaced. Always grep -rn '<feature>' web/server.mjs src/ BEFORE delegating implementation work; the subagent timeouts in this session were directly caused by re-exploring existing code.

Wizard UX principles (Astral onboarding — Typeform-style, landed 2026-05-22, commit 36023bd)

Alex's explicit preference for any multi-step wizard in these apps:

One screen, one decision, one button. A wizard step that ends with 7 parallel buttons (\"Open chat\", \"Open Stripe\", \"Open account\", \"Provision another\", \"Open device auth\"...) is a UX failure. Every screen has exactly ONE primary CTA. Secondary actions are de-emphasized text links or hidden.
Sequential, not dashboard. Post-provisioning is NOT a success page with multiple actions — it's the next steps as additional wizard steps: provisioning (spinner) → providerAuth (conditional) → billing (conditional Stripe) → ready (brief confirmation + auto-redirect to /chat). The wizard never ends with a menu; it always proceeds to the next screen or to the actual product.
Skip steps silently when not needed. API-key providers skip credential-OAuth step. Non-Stripe-configured deployments skip billing. The progress bar recomputes from rendered steps, not from a static count.
No raw stderr in user-facing UI. Provisioning errors collapse to a friendly retry message; admins can expand a <details> for the actual stderr. Anyone seeing astral-tenant-hhfggg Up 3 seconds astral-hermes-runner:dev in the UI is the bug.
Auto-derive what you can. Tenant ID auto-slugifies from agent name — never make the user type both. Owner email/name fall back to the logged-in session account.
Admins are NOT exempt from sandbox Stripe. Until the Stripe keys flip from sandbox to live, admins go through the same checkout flow as users. This is intentional: it validates the flow end-to-end before live keys turn on. Don't add an admin-bypass branch.
Stripe success/cancel URLs route back into the wizard. success_url: /onboarding?tenant=<id>&checkout=ok, cancel_url: /onboarding?tenant=<id>&checkout=cancel. The wizard reads these query params on mount and either advances to ready step or stays on billing.
No Telegram in onboarding. Telegram setup belongs in the tenant settings panel post-provision. A dismissable nudge banner on /chat and /account (\"Want Telegram access? Set it up in settings →\") deep-links to /account/tenant/<id>#telegram for tenants without it configured. Telegram is fiddly (BotFather + numeric user IDs) and shouldn't gate first-chat.

Tenant settings panel (Astral — landed 2026-05-22, commit 82887f3)

Route: /account/tenant/<tenantId>. Tabs: provider, telegram, danger zone.

Endpoints (all gated by requireTenantOwner which allows either the owning account OR any admin):

GET /api/tenant/:tenantId/settings — masked snapshot (provider id + model + has-api-key + masked-fingerprint, telegram-configured boolean, entitlement status). DO NOT return raw token or user-ID list.
POST /api/tenant/:tenantId/provider — change provider (writes new config.yaml + .env, restarts container). Returns providerAuthRequired: true if OAuth-based so UI prompts re-auth next.
POST /api/tenant/:tenantId/provider/key — update API key only. Validates current provider is api-key-based.
POST /api/tenant/:tenantId/telegram — write or replace bot config. Calls lookupTelegramBotUsername(token) to verify the token works before saving.
DELETE /api/tenant/:tenantId/telegram — wipe the three TELEGRAM_* keys from .env, restart container.

Key helpers:

mergeEnvFile(existing, updates, removeKeys) preserves comments and key ordering, deduplicates keys.
readTenantEnv(tenantId) / writeTenantEnvAndRestart(tenantId, newContent) use base64 + python heredoc + atomic temp+rename for safe writes with 0600 perms.
Container restart: docker restart astral-tenant-<id> (faster than rm+run) UNLESS you're changing env vars — then full recreate.

OAuth re-auth from settings panel just calls the existing /api/provider/device-auth + /api/provider/complete (Codex) and /api/provider/anthropic/start + /api/provider/anthropic/complete (Anthropic) endpoints — they already work for existing tenants, not just new ones. installCodexCredential and installAnthropicOauthCredential overwrite the auth.json credential, which is the correct behavior for re-auth.

Chat UI patterns (Astral — landed 2026-05-22, commits dfe72d1 + edcb711)

Alex's explicit preferences for the chat surface:

Assistant messages: full-width, no bubble. ChatGPT-style. Plain text/markdown rendering with white-space: pre-wrap. CSS class targets: .chat-bubble.assistant:not(.audio):not(.error):not(.typing) gets width:100%, transparent background, no border/shadow, minimal padding.
User messages: keep the bubble. Right-aligned, max-width ~75%. Audio messages also bubble (they have the player UI).
Error/system messages: subtle styled box. Not full-width, not full bubble.
Voice lifecycle feedback (4 phases, not one): uploading… (pre-fetch, client) → transcribing… (server transcribing event) → sent ✓ (server transcribed event) → received ✓ (server done event). Drive this via SSE events, not artificial timers.
Tool-call streaming via SSE. Both /api/chat/message and /api/chat/voice support SSE when Accept: text/event-stream. Server drops the -Q (quiet) flag on hermes chat, spawns via sshStream() (line-buffered stdout), parses verbose output via parseHermesProgressLine(), and emits events transcribing | transcribed | start | tool_call | progress | text | done | error. JSON contract preserved when Accept header is absent — backward compatible. Client shows progress events as small italic gray lines below the assistant response (Telegram-style activity log).
Sidebar layout on /chat: Return-to-admin button lives INSIDE the chat sidebar as a header item for admins (not as a floating fixed pill). Floating ReturnToAdminPill self-hides on /chat to avoid double-rendering. Account + Settings buttons live next to each other in .sidebar-action-links for easy access.

References

references/codex-auth-shape-fix.md — the codyguy/cody1 "No Codex credentials stored" root cause and dual-write fix.
references/provider-config-cheatsheet.md — quick reference for the providerConfig() shape per provider (envName, config block, post-create requirements).
references/admin-role-overhaul-astral.md — file-by-file checklist of the role-based admin migration in Astral, useful when porting to Spawn.
references/kb-env-var-in-docker-run.md — root cause and fix for "agent claims to save to KB but file never appears on host" (the ASTRAL_KB_ROOT .env-vs-docker run lesson).
references/typeform-wizard-principles.md — the principles Alex explicitly approved for the Astral onboarding rewrite. Apply to any future wizard in either app.