--- name: hermes-tenant-control-plane description: Class-level patterns for Alex's two multi-tenant Hermes control-plane apps (Astral Hermes and Hermes Spawn). Use when modifying tenant provisioning, provider/auth flows (OpenAI API key, Codex OAuth, Anthropic API key, Anthropic OAuth, OpenRouter), admin/role-based auth, KB ingestion sessions, S3 storage quotas, or anything else that touches the shared 80% of these apps' architecture. Each repo has its own deploy + repo, but the same patterns apply. version: 1.0.0 author: Hermes Agent + Alex license: MIT metadata: hermes: tags: [hermes, multi-tenant, control-plane, astral-hermes, hermes-spawn, provisioning, oauth, codex, anthropic, role-based-auth, tenant-management] related_skills: [vps-app-deployment, subagent-driven-development, requesting-code-review] --- # Hermes Tenant Control Plane Alex runs **two** multi-tenant Hermes control planes that share ~80% of their architecture. Treat them as siblings: the same patterns, pitfalls, and conventions apply to both. ## The Two Apps | | Astral Hermes | Hermes Spawn | |---|---|---| | Scope | Vertical (astrology) | Horizontal (general-purpose Hermes) | | App path | `/home/avalon/apps/astral-hermes-platform` | `/home/avalon/apps/hermes-spawn` | | PM2 process | `astral-hermes-web` | `hermes-spawn` | | Public URL | `astral.apps.poofc.com` | `spawn.apps.poofc.com` | | Port (control plane) | varies (Express) | `4031` | | Tenant root (worker) | `/srv/astral/tenants/` | `/home/avalon/hermes-spawn-tenants/` | | Tenant container name | `astral-tenant-` | `spawn-tenant-` | | Worker host | `avalon@5.78.199.26` | `avalon@5.78.199.26` | | GitHub | `firemountain/astral-hermes-platform` | `firemountain/hermes-spawn` | | Domain skill bundles | `astral-core` + `astral-hd` (versioned) | user picks skills à la carte | Astral preceded Spawn architecturally; many Spawn patterns originated in Astral. Pre-existing implementations of "new" features are common — **always grep first** before delegating or writing fresh code. ## Architectural conventions (apply to both apps) ### Tenant container layout Each tenant has a hermes-home dir on the worker (`//hermes-home/`) containing: - `config.yaml` — Hermes config (model provider, skills, toolsets, storage, etc.) - `.env` — tenant secrets (API keys, Telegram tokens) - `auth.json` — provider OAuth/credential pool (see "Auth shapes" below) - `data/hermes/knowledge/` — tenant KB (Spawn; Astral has its own variant) - `data/hermes/scripts/` — guard/util scripts (S3 quota guard etc.) The container runs `astral-hermes-runner` (Astral) or `spawn-hermes-runner` (Spawn) image and is started/stopped via the control plane's SSH-exec layer. ### Provider system Both apps use a `providerConfig({ provider, model })` function returning `{ envName, config, note? }`: - **`openai`** (or `openai-api` in Spawn): `envName: OPENAI_API_KEY`, `provider: custom` w/ `base_url: https://api.openai.com/v1`, `api_mode: chat_completions` - **`openai-codex`**: ChatGPT subscription OAuth. `envName: ''` (no env var). `provider: openai-codex`. Requires post-create device-auth. - **`anthropic`**: `envName: ANTHROPIC_API_KEY`. `provider: anthropic`, `api_mode: anthropic_messages`, default `claude-sonnet-4-5`. Working in both apps as of 2026-05-22. - **`anthropic-oauth`**: Claude Pro/Max subscription via PKCE OAuth. Astral has helpers built (`installAnthropicOauthCredential`, `exchangeAnthropicAuthorization`, `buildAnthropicAuthorizeUrl`, `generateAnthropicPkce`) but no completion Express route or UI flow yet. Spawn marks it `oauth-unavailable` with a clear "use API-key path for now" message pointing users to `hermes auth add anthropic --type oauth` inside the container. - **`openrouter`**: `envName: OPENROUTER_API_KEY`. `provider: openrouter`. When adding a new provider, update `providerConfig`, the `betaProviderMatrix` / `publicProviderOptions` (Astral: `src/provider-matrix.mjs`), the wizard UI, the allowed-providers gate in `/api/provision` (Spawn: `server.mjs` ~line 818, "allowed = new Set([...])"), and security regression tests. ### Auth shapes — the dual-write pattern Hermes core resolves OAuth credentials from **two locations** in `auth.json`. Both must be written or some code paths (gateway, cron, CLI probes) silently fail. For `openai-codex` (Codex device-auth): ```python # credential_pool shape — used by `hermes auth list` and status probes store['credential_pool']['openai-codex'] = [entry, ...] # providers shape — used by gateway/cron/model calls at runtime store['providers']['openai-codex'] = { 'tokens': {'access_token': ..., 'refresh_token': ...}, 'last_refresh': ..., 'auth_mode': 'chatgpt', 'base_url': ..., } store['active_provider'] = 'openai-codex' ``` Spawn writes both shapes in `server.mjs` ~line 274 (search for the Python heredoc that writes `auth.json`). Astral writes both shapes via `installCodexCredential`. **If a tenant reports "No Codex credentials stored" while `hermes auth list` shows the credential exists, the runtime shape is missing — re-run device-auth completion or hand-patch.** See `references/codex-auth-shape-fix.md` for the codyguy/cody1 incident. For Anthropic OAuth (Astral only, helpers built): ```python # credential_pool shape store['credential_pool']['anthropic'] = [entry, ...] # Plus a separate .anthropic_oauth.json file at hermes-home root with # {accessToken, refreshToken, expiresAt} — Hermes core reads this directly. ``` ### Provisioning flow (both apps) 1. User submits wizard (account creation, tenant id, provider choice, optional Telegram bot, allowed user IDs). 2. Control plane validates invite (Astral: `INVITE_CODE`; Spawn: gated user registration). 3. SSH to worker, create `//hermes-home/`, write `config.yaml` + `.env`. 4. Install skill bundles or selected skills via the worker-side `astral` / `spawn` CLI helper. 5. (Spawn) Create dedicated Hetzner S3 bucket with quota. 6. Start the Docker container. 7. If provider needs post-create auth (Codex / Anthropic OAuth), surface the device-auth step. 8. Register entitlement in the ledger (`src/entitlement-ledger.mjs` in Astral; equivalent DB in Spawn). 9. If billing enabled (Astral Stripe), create checkout session. ### Skill bundles vs à la carte Astral ships **versioned bundles** (`astral-core@0.2.0`, `astral-hd@0.2.0`) installed via `node bin/astral.mjs bundle install --tenant --bundle --version [--replace] [--restart]`. Use this for any persona-shaped tenant set: write the bundle once, install across many tenants, version-pin. Spawn lets users pick individual skills. There is a known cross-pollination idea: bring versioned bundles to Spawn so users can install "the trading bundle" or "the writing bundle" as a unit. Not built yet. When changing what bundles are installed by default, update `installProvisionBundles()` in Astral (`web/server.mjs` ~line 804) AND backfill existing tenants by running `astral bundle install` per tenant. See the bundle-v0.2.0 backfill session for the loop pattern. ## Role-based admin (Astral — landed 2026-05-22) Astral previously used a shared `ASTRAL_ADMIN_TOKEN` bearer. Replaced with role-based auth driven by `ASTRAL_ADMIN_EMAILS` env var. Key pieces: - **Ledger** (`src/entitlement-ledger.mjs`): accounts have a `role` field (`'admin' | 'user'`), default `'user'`. Methods: `setAccountRole(id, role)`, `safeAccount()` includes role. `upsertAccount` preserves existing role; new accounts default to `'user'`. - **Server** (`web/server.mjs`): `ADMIN_EMAILS = new Set(...)` parsed at module load from `ASTRAL_ADMIN_EMAILS` (comma-separated, lowercased). `requireAdmin` middleware checks session cookie + `account.role === 'admin'` → 401 if logged out, 403 if logged in but not admin. The middleware NAME is preserved so all `/api/admin/*` routes are unchanged. - **Promotion**: `applyAdminRoleFromEnv(account)` runs in `/api/auth/register` AND `/api/auth/login` so accounts that pre-date being added to the env list get promoted at next login. - **UI**: `AdminApp` fetches `/api/auth/me` on mount. Logged out → redirect `/account?next=/admin`. Logged in but `role !== 'admin'` → "Not authorized" screen. Admin → normal UI with a **"View as user"** button. Clicking it sets `localStorage.astralAdminViewAsUser=true` and routes to `/account`. The `ReturnToAdminPill` component renders at the app root and shows a fixed top-right "← Return to admin" pill on `/account` and `/chat` whenever the flag is set AND `/api/auth/me` confirms admin role. Clicking clears the flag. - **Deprecation**: `ASTRAL_ADMIN_TOKEN` env is still read so old deploys don't crash, but it grants nothing. The "Unlock admin" form, `astralAdminToken` localStorage, and Bearer header injection were fully removed. To bootstrap on the live host: `ASTRAL_ADMIN_EMAILS=firemountain@gmail.com` in `.env`, then `pm2 restart astral-hermes-web --update-env`. If extending the same pattern to Hermes Spawn, mirror the env-var-seeded role + `applyAdminRoleFromEnv()` approach. ## Cross-app cross-pollination map Things one app does well that the other can adopt: - **Astral → Spawn**: versioned skill bundles; polished image rendering (resvg + custom fonts via `svghanddraw`); post-provision capability smoke loop ("can this tenant call its KB? S3? Telegram bot?"); shared backend services pattern (Astral's `transit-list-demo` as a tenant-callable shared service). - **Spawn → Astral**: per-tenant KB ingestion sessions (durable async jobs, transcript-style UI, interrupt button); per-tenant S3 buckets + quotas; gated user registration; the auth dual-write fix. The admin-role pattern just landed in Astral; if Spawn is asked for the same, port it directly. ## Deploy & smoke (both apps, same shape) ```bash # Astral cd /home/avalon/apps/astral-hermes-platform/web && npm run build && cd .. && npm run test:security git add -A && git commit -m "..." && git push origin main pm2 restart astral-hermes-web --update-env curl -s -o /dev/null -w "%{http_code}\n" https://astral.apps.poofc.com/api/health # expect 200 # Spawn cd /home/avalon/apps/hermes-spawn && npm test && npm run build git add -A && git commit -m "..." && git push origin main pm2 restart hermes-spawn --update-env curl -s -o /dev/null -w "%{http_code}\n" https://spawn.apps.poofc.com/api/health # expect 200 ``` Always `pm2 restart --update-env` after any `.env` change, not just `pm2 restart`. ## Pitfalls 1. **Provisioning seems fine but chat fails with "quota exceeded" on Codex tenant**: Voice transcription fell back to control-plane OpenAI key that's quota-exhausted. Voice transcription is *not* automatically routed through Codex subscription — it uses `ASTRAL_TRANSCRIPTION_OPENAI_KEY` / `VOICE_TOOLS_OPENAI_KEY`. If that quota is dry, transcription silently fails for ALL subscription-auth tenants. Astral now shows a friendly "Voice transcription is temporarily unavailable" instead of the raw OpenAI billing URL. 2. **Tenant says "No Codex credentials stored" but `hermes auth list` shows them**: The dual-write was incomplete. Re-run device-auth completion or hand-patch `providers['openai-codex'].tokens` + `active_provider` in `auth.json`. See `references/codex-auth-shape-fix.md`. 3. **Skills bundles "missing" after backfill**: The skills were installed but `terminal` was disabled in `platform_toolsets`, OR `web.backend: firecrawl` had no credits. The skills can load but cannot *execute* the API calls they wrap. Verify terminal is enabled AND the relevant API keys are populated. The 2026-05-17 backfill (commit `faa92e8` in Astral) re-enabled terminal for tenants and injected `ASTRAL_TENANT_FIRECRAWL_API_KEY`. 4. **Subagent timeout when adding a feature to one of these apps**: see `subagent-driven-development` skill's pre-flight discovery section. Both repos accumulate "almost finished" features that need enabling rather than rebuilding. Always grep first. 5. **Telegram bot prefills credentials**: Public onboarding/login UIs MUST NOT prefill remembered IDs, emails, or default passwords. Alex is security-conscious about this — explicit user-profile note. 6. **`mayaastral` tenant has a config shape difference**: discovered during 2026-05-17 backfill. If a backfill loop succeeds for most tenants but `mayaastral` shows missing `platform_toolsets.terminal`, that tenant needs a manual pass. 7. **Agent claims to write to KB but file never appears on host**: The `astral-tenant-kb` skill reads `ASTRAL_KB_ROOT` from the **process environment**, not the `.env` file. Hermes does NOT auto-load `/data/hermes/.env` into the child Python processes that skills spawn. Result: if `-e ASTRAL_KB_ROOT=/data/hermes/knowledge` is missing from `docker run`, the skill silently falls back to `~/.hermes/astral-kb-prototype/knowledge/` inside the writable container layer and the agent reports success while nothing persists to the mounted volume. - **Fix**: bake the env var into the `docker run` line in BOTH `web/server.mjs` (provisioning script ~line 1608) AND `bin/astral.mjs` (`bundle install --restart` path, line 178). Adding to `.env` is necessary but not sufficient; existing containers need `docker rm -f && docker run` (not just `docker restart`) because env can only be set at container creation. - **Verify**: `docker exec env | grep ASTRAL_KB_ROOT` → must show the path. Then ssh write a probe file into the tenant's `knowledge/raw/.probe` and confirm the container sees it via `docker exec ls /data/hermes/knowledge/raw/`. - **Detection signal**: agent says \"saved Kathleen Brown's chart to your knowledge base\" but `find /srv/astral/tenants//hermes-home/knowledge/entities/people/` is empty. Same root cause every time. Commit `457b9aa` in Astral. 8. **Tenant has no `knowledge/` directory at all**: Older tenants (notably `mayaastral`, the `CHAT_DEFAULT_TENANT`) were provisioned before the KB scaffold became a default. The scaffold is created by `createKnowledgeScaffold({ tenantId, tenantName })` in `web/server.mjs`. Backfill via the admin endpoint: ```bash curl -X POST -b \"$ADMIN_COOKIE\" \\\n https://astral.apps.poofc.com/api/admin/tenants//ensure-knowledge\n ```\n Or via the one-off helper at `scripts/scaffold-kb.mjs`. The endpoint uses `if not p.exists()` guards so it's idempotent and safe to re-run. 9. **Cross-tenant isolation is real but verify after changes**: Each tenant has its own `//hermes-home/` mounted at `/data/hermes` inside its dedicated container (`astral-tenant-`). No shared volumes. Containers run with `--cap-drop=ALL`, `--security-opt no-new-privileges:true`, and `--memory=1g`. If you ever rewrite container creation, verify isolation by: (a) writing a probe file in tenant A's KB on the host, (b) listing the same path from tenant B's container — must be empty. `requireChatTenantAccess` and `requireTenantOwner` middleware also enforce app-level gates. 10. **Pre-existing helpers exist for \"new\" features — search before delegating**: Recurring pattern in this codebase: Anthropic OAuth PKCE helpers (`installAnthropicOauthCredential`, `exchangeAnthropicAuthorization`, `buildAnthropicAuthorizeUrl`, `generateAnthropicPkce`) were already built but disabled via `enabled: false` in `src/provider-matrix.mjs`. Similarly, KB ingestion sessions existed in Spawn before they were surfaced. Always `grep -rn '' web/server.mjs src/` BEFORE delegating implementation work; the subagent timeouts in this session were directly caused by re-exploring existing code. ## Wizard UX principles (Astral onboarding — Typeform-style, landed 2026-05-22, commit 36023bd) Alex's explicit preference for any multi-step wizard in these apps: - **One screen, one decision, one button.** A wizard step that ends with 7 parallel buttons (\"Open chat\", \"Open Stripe\", \"Open account\", \"Provision another\", \"Open device auth\"...) is a UX failure. Every screen has exactly ONE primary CTA. Secondary actions are de-emphasized text links or hidden. - **Sequential, not dashboard.** Post-provisioning is NOT a success page with multiple actions — it's the next steps as additional wizard steps: `provisioning` (spinner) → `providerAuth` (conditional) → `billing` (conditional Stripe) → `ready` (brief confirmation + auto-redirect to /chat). The wizard never ends with a menu; it always proceeds to the next screen or to the actual product. - **Skip steps silently when not needed.** API-key providers skip credential-OAuth step. Non-Stripe-configured deployments skip billing. The progress bar recomputes from rendered steps, not from a static count. - **No raw stderr in user-facing UI.** Provisioning errors collapse to a friendly retry message; admins can expand a `
` for the actual stderr. Anyone seeing `astral-tenant-hhfggg Up 3 seconds astral-hermes-runner:dev` in the UI is the bug. - **Auto-derive what you can.** Tenant ID auto-slugifies from agent name — never make the user type both. Owner email/name fall back to the logged-in session account. - **Admins are NOT exempt from sandbox Stripe.** Until the Stripe keys flip from sandbox to live, admins go through the same checkout flow as users. This is intentional: it validates the flow end-to-end before live keys turn on. Don't add an admin-bypass branch. - **Stripe success/cancel URLs route back into the wizard.** `success_url: /onboarding?tenant=&checkout=ok`, `cancel_url: /onboarding?tenant=&checkout=cancel`. The wizard reads these query params on mount and either advances to `ready` step or stays on `billing`. - **No Telegram in onboarding.** Telegram setup belongs in the tenant settings panel post-provision. A dismissable nudge banner on `/chat` and `/account` (\"Want Telegram access? Set it up in settings →\") deep-links to `/account/tenant/#telegram` for tenants without it configured. Telegram is fiddly (BotFather + numeric user IDs) and shouldn't gate first-chat. ## Tenant settings panel (Astral — landed 2026-05-22, commit 82887f3) Route: `/account/tenant/`. Tabs: provider, telegram, danger zone. Endpoints (all gated by `requireTenantOwner` which allows either the owning account OR any admin): - `GET /api/tenant/:tenantId/settings` — masked snapshot (provider id + model + has-api-key + masked-fingerprint, telegram-configured boolean, entitlement status). DO NOT return raw token or user-ID list. - `POST /api/tenant/:tenantId/provider` — change provider (writes new `config.yaml` + `.env`, restarts container). Returns `providerAuthRequired: true` if OAuth-based so UI prompts re-auth next. - `POST /api/tenant/:tenantId/provider/key` — update API key only. Validates current provider is api-key-based. - `POST /api/tenant/:tenantId/telegram` — write or replace bot config. Calls `lookupTelegramBotUsername(token)` to verify the token works before saving. - `DELETE /api/tenant/:tenantId/telegram` — wipe the three `TELEGRAM_*` keys from .env, restart container. Key helpers: - `mergeEnvFile(existing, updates, removeKeys)` preserves comments and key ordering, deduplicates keys. - `readTenantEnv(tenantId)` / `writeTenantEnvAndRestart(tenantId, newContent)` use base64 + python heredoc + atomic temp+rename for safe writes with 0600 perms. - Container restart: `docker restart astral-tenant-` (faster than rm+run) UNLESS you're changing env vars — then full recreate. OAuth re-auth from settings panel just calls the existing `/api/provider/device-auth` + `/api/provider/complete` (Codex) and `/api/provider/anthropic/start` + `/api/provider/anthropic/complete` (Anthropic) endpoints — they already work for existing tenants, not just new ones. `installCodexCredential` and `installAnthropicOauthCredential` overwrite the auth.json credential, which is the correct behavior for re-auth. ## Chat UI patterns (Astral — landed 2026-05-22, commits dfe72d1 + edcb711) Alex's explicit preferences for the chat surface: - **Assistant messages: full-width, no bubble.** ChatGPT-style. Plain text/markdown rendering with `white-space: pre-wrap`. CSS class targets: `.chat-bubble.assistant:not(.audio):not(.error):not(.typing)` gets `width:100%`, transparent background, no border/shadow, minimal padding. - **User messages: keep the bubble.** Right-aligned, max-width ~75%. Audio messages also bubble (they have the player UI). - **Error/system messages: subtle styled box.** Not full-width, not full bubble. - **Voice lifecycle feedback (4 phases, not one)**: `uploading…` (pre-fetch, client) → `transcribing…` (server `transcribing` event) → `sent ✓` (server `transcribed` event) → `received ✓` (server `done` event). Drive this via SSE events, not artificial timers. - **Tool-call streaming via SSE.** Both `/api/chat/message` and `/api/chat/voice` support SSE when `Accept: text/event-stream`. Server drops the `-Q` (quiet) flag on `hermes chat`, spawns via `sshStream()` (line-buffered stdout), parses verbose output via `parseHermesProgressLine()`, and emits events `transcribing | transcribed | start | tool_call | progress | text | done | error`. JSON contract preserved when Accept header is absent — backward compatible. Client shows progress events as small italic gray lines below the assistant response (Telegram-style activity log). - **Sidebar layout on /chat**: Return-to-admin button lives INSIDE the chat sidebar as a header item for admins (not as a floating fixed pill). Floating `ReturnToAdminPill` self-hides on `/chat` to avoid double-rendering. Account + Settings buttons live next to each other in `.sidebar-action-links` for easy access. ## References - `references/codex-auth-shape-fix.md` — the codyguy/cody1 "No Codex credentials stored" root cause and dual-write fix. - `references/provider-config-cheatsheet.md` — quick reference for the `providerConfig()` shape per provider (envName, config block, post-create requirements). - `references/admin-role-overhaul-astral.md` — file-by-file checklist of the role-based admin migration in Astral, useful when porting to Spawn. - `references/kb-env-var-in-docker-run.md` — root cause and fix for "agent claims to save to KB but file never appears on host" (the `ASTRAL_KB_ROOT` `.env`-vs-`docker run` lesson). - `references/typeform-wizard-principles.md` — the principles Alex explicitly approved for the Astral onboarding rewrite. Apply to any future wizard in either app.