---
name: hermes-tenant-control-plane
description: Class-level patterns for Alex's two multi-tenant Hermes control-plane apps (Astral Hermes and Hermes Spawn). Use when modifying tenant provisioning, provider/auth flows (OpenAI API key, Codex OAuth, Anthropic API key, Anthropic OAuth, OpenRouter), admin/role-based auth, KB ingestion sessions, S3 storage quotas, or anything else that touches the shared 80% of these apps' architecture. Each repo has its own deploy + repo, but the same patterns apply.
version: 1.0.0
author: Hermes Agent + Alex
license: MIT
metadata:
  hermes:
    tags: [hermes, multi-tenant, control-plane, astral-hermes, hermes-spawn, provisioning, oauth, codex, anthropic, role-based-auth, tenant-management]
    related_skills: [vps-app-deployment, subagent-driven-development, requesting-code-review]
---

# Hermes Tenant Control Plane

Alex runs **two** multi-tenant Hermes control planes that share ~80% of their architecture. Treat them as siblings: the same patterns, pitfalls, and conventions apply to both.

## The Two Apps

| | Astral Hermes | Hermes Spawn |
|---|---|---|
| Scope | Vertical (astrology) | Horizontal (general-purpose Hermes) |
| App path | `/home/avalon/apps/astral-hermes-platform` | `/home/avalon/apps/hermes-spawn` |
| PM2 process | `astral-hermes-web` | `hermes-spawn` |
| Public URL | `astral.apps.poofc.com` | `spawn.apps.poofc.com` |
| Port (control plane) | varies (Express) | `4031` |
| Tenant root (worker) | `/srv/astral/tenants/<id>` | `/home/avalon/hermes-spawn-tenants/<id>` |
| Tenant container name | `astral-tenant-<id>` | `spawn-tenant-<id>` |
| Worker host | `avalon@5.78.199.26` | `avalon@5.78.199.26` |
| GitHub | `firemountain/astral-hermes-platform` | `firemountain/hermes-spawn` |
| Domain skill bundles | `astral-core` + `astral-hd` (versioned) | user picks skills à la carte |

Astral preceded Spawn architecturally; many Spawn patterns originated in Astral. Pre-existing implementations of "new" features are common — **always grep first** before delegating or writing fresh code.

## Architectural conventions (apply to both apps)

### Tenant container layout

Each tenant has a hermes-home dir on the worker (`<TENANT_ROOT>/<id>/hermes-home/`) containing:

- `config.yaml` — Hermes config (model provider, skills, toolsets, storage, etc.)
- `.env` — tenant secrets (API keys, Telegram tokens)
- `auth.json` — provider OAuth/credential pool (see "Auth shapes" below)
- `data/hermes/knowledge/` — tenant KB (Spawn; Astral has its own variant)
- `data/hermes/scripts/` — guard/util scripts (S3 quota guard etc.)

The container runs `astral-hermes-runner` (Astral) or `spawn-hermes-runner` (Spawn) image and is started/stopped via the control plane's SSH-exec layer.

### Provider system

Both apps use a `providerConfig({ provider, model })` function returning `{ envName, config, note? }`:

- **`openai`** (or `openai-api` in Spawn): `envName: OPENAI_API_KEY`, `provider: custom` w/ `base_url: https://api.openai.com/v1`, `api_mode: chat_completions`
- **`openai-codex`**: ChatGPT subscription OAuth. `envName: ''` (no env var). `provider: openai-codex`. Requires post-create device-auth.
- **`anthropic`**: `envName: ANTHROPIC_API_KEY`. `provider: anthropic`, `api_mode: anthropic_messages`, default `claude-sonnet-4-5`. Working in both apps as of 2026-05-22.
- **`anthropic-oauth`**: Claude Pro/Max subscription via PKCE OAuth. Astral has helpers built (`installAnthropicOauthCredential`, `exchangeAnthropicAuthorization`, `buildAnthropicAuthorizeUrl`, `generateAnthropicPkce`) but no completion Express route or UI flow yet. Spawn marks it `oauth-unavailable` with a clear "use API-key path for now" message pointing users to `hermes auth add anthropic --type oauth` inside the container.
- **`openrouter`**: `envName: OPENROUTER_API_KEY`. `provider: openrouter`.

When adding a new provider, update `providerConfig`, the `betaProviderMatrix` / `publicProviderOptions` (Astral: `src/provider-matrix.mjs`), the wizard UI, the allowed-providers gate in `/api/provision` (Spawn: `server.mjs` ~line 818, "allowed = new Set([...])"), and security regression tests.

### Auth shapes — the dual-write pattern

Hermes core resolves OAuth credentials from **two locations** in `auth.json`. Both must be written or some code paths (gateway, cron, CLI probes) silently fail.

For `openai-codex` (Codex device-auth):

```python
# credential_pool shape — used by `hermes auth list` and status probes
store['credential_pool']['openai-codex'] = [entry, ...]
# providers shape — used by gateway/cron/model calls at runtime
store['providers']['openai-codex'] = {
    'tokens': {'access_token': ..., 'refresh_token': ...},
    'last_refresh': ...,
    'auth_mode': 'chatgpt',
    'base_url': ...,
}
store['active_provider'] = 'openai-codex'
```

Spawn writes both shapes in `server.mjs` ~line 274 (search for the Python heredoc that writes `auth.json`). Astral writes both shapes via `installCodexCredential`. **If a tenant reports "No Codex credentials stored" while `hermes auth list` shows the credential exists, the runtime shape is missing — re-run device-auth completion or hand-patch.** See `references/codex-auth-shape-fix.md` for the codyguy/cody1 incident.

For Anthropic OAuth (Astral only, helpers built):

```python
# credential_pool shape
store['credential_pool']['anthropic'] = [entry, ...]
# Plus a separate .anthropic_oauth.json file at hermes-home root with
# {accessToken, refreshToken, expiresAt} — Hermes core reads this directly.
```

### Provisioning flow (both apps)

1. User submits wizard (account creation, tenant id, provider choice, optional Telegram bot, allowed user IDs).
2. Control plane validates invite (Astral: `INVITE_CODE`; Spawn: gated user registration).
3. SSH to worker, create `<TENANT_ROOT>/<id>/hermes-home/`, write `config.yaml` + `.env`.
4. Install skill bundles or selected skills via the worker-side `astral` / `spawn` CLI helper.
5. (Spawn) Create dedicated Hetzner S3 bucket with quota.
6. Start the Docker container.
7. If provider needs post-create auth (Codex / Anthropic OAuth), surface the device-auth step.
8. Register entitlement in the ledger (`src/entitlement-ledger.mjs` in Astral; equivalent DB in Spawn).
9. If billing enabled (Astral Stripe), create checkout session.

### Skill bundles vs à la carte

Astral ships **versioned bundles** (`astral-core@0.2.0`, `astral-hd@0.2.0`) installed via `node bin/astral.mjs bundle install --tenant <id> --bundle <name> --version <v> [--replace] [--restart]`. Use this for any persona-shaped tenant set: write the bundle once, install across many tenants, version-pin.

Spawn lets users pick individual skills. There is a known cross-pollination idea: bring versioned bundles to Spawn so users can install "the trading bundle" or "the writing bundle" as a unit. Not built yet.

When changing what bundles are installed by default, update `installProvisionBundles()` in Astral (`web/server.mjs` ~line 804) AND backfill existing tenants by running `astral bundle install` per tenant. See the bundle-v0.2.0 backfill session for the loop pattern.

## Role-based admin (Astral — landed 2026-05-22)

Astral previously used a shared `ASTRAL_ADMIN_TOKEN` bearer. Replaced with role-based auth driven by `ASTRAL_ADMIN_EMAILS` env var.

Key pieces:

- **Ledger** (`src/entitlement-ledger.mjs`): accounts have a `role` field (`'admin' | 'user'`), default `'user'`. Methods: `setAccountRole(id, role)`, `safeAccount()` includes role. `upsertAccount` preserves existing role; new accounts default to `'user'`.
- **Server** (`web/server.mjs`): `ADMIN_EMAILS = new Set(...)` parsed at module load from `ASTRAL_ADMIN_EMAILS` (comma-separated, lowercased). `requireAdmin` middleware checks session cookie + `account.role === 'admin'` → 401 if logged out, 403 if logged in but not admin. The middleware NAME is preserved so all `/api/admin/*` routes are unchanged.
- **Promotion**: `applyAdminRoleFromEnv(account)` runs in `/api/auth/register` AND `/api/auth/login` so accounts that pre-date being added to the env list get promoted at next login.
- **UI**: `AdminApp` fetches `/api/auth/me` on mount. Logged out → redirect `/account?next=/admin`. Logged in but `role !== 'admin'` → "Not authorized" screen. Admin → normal UI with a **"View as user"** button. Clicking it sets `localStorage.astralAdminViewAsUser=true` and routes to `/account`. The `ReturnToAdminPill` component renders at the app root and shows a fixed top-right "← Return to admin" pill on `/account` and `/chat` whenever the flag is set AND `/api/auth/me` confirms admin role. Clicking clears the flag.
- **Deprecation**: `ASTRAL_ADMIN_TOKEN` env is still read so old deploys don't crash, but it grants nothing. The "Unlock admin" form, `astralAdminToken` localStorage, and Bearer header injection were fully removed.

To bootstrap on the live host: `ASTRAL_ADMIN_EMAILS=firemountain@gmail.com` in `.env`, then `pm2 restart astral-hermes-web --update-env`.

If extending the same pattern to Hermes Spawn, mirror the env-var-seeded role + `applyAdminRoleFromEnv()` approach.

## Cross-app cross-pollination map

Things one app does well that the other can adopt:

- **Astral → Spawn**: versioned skill bundles; polished image rendering (resvg + custom fonts via `svghanddraw`); post-provision capability smoke loop ("can this tenant call its KB? S3? Telegram bot?"); shared backend services pattern (Astral's `transit-list-demo` as a tenant-callable shared service).
- **Spawn → Astral**: per-tenant KB ingestion sessions (durable async jobs, transcript-style UI, interrupt button); per-tenant S3 buckets + quotas; gated user registration; the auth dual-write fix.

The admin-role pattern just landed in Astral; if Spawn is asked for the same, port it directly.

## Deploy & smoke (both apps, same shape)

```bash
# Astral
cd /home/avalon/apps/astral-hermes-platform/web && npm run build && cd .. && npm run test:security
git add -A && git commit -m "..." && git push origin main
pm2 restart astral-hermes-web --update-env
curl -s -o /dev/null -w "%{http_code}\n" https://astral.apps.poofc.com/api/health  # expect 200

# Spawn
cd /home/avalon/apps/hermes-spawn && npm test && npm run build
git add -A && git commit -m "..." && git push origin main
pm2 restart hermes-spawn --update-env
curl -s -o /dev/null -w "%{http_code}\n" https://spawn.apps.poofc.com/api/health  # expect 200
```

Always `pm2 restart --update-env` after any `.env` change, not just `pm2 restart`.

## Pitfalls

1. **Provisioning seems fine but chat fails with "quota exceeded" on Codex tenant**: Voice transcription fell back to control-plane OpenAI key that's quota-exhausted. Voice transcription is *not* automatically routed through Codex subscription — it uses `ASTRAL_TRANSCRIPTION_OPENAI_KEY` / `VOICE_TOOLS_OPENAI_KEY`. If that quota is dry, transcription silently fails for ALL subscription-auth tenants. Astral now shows a friendly "Voice transcription is temporarily unavailable" instead of the raw OpenAI billing URL.

2. **Tenant says "No Codex credentials stored" but `hermes auth list` shows them**: The dual-write was incomplete. Re-run device-auth completion or hand-patch `providers['openai-codex'].tokens` + `active_provider` in `auth.json`. See `references/codex-auth-shape-fix.md`.

3. **Skills bundles "missing" after backfill**: The skills were installed but `terminal` was disabled in `platform_toolsets`, OR `web.backend: firecrawl` had no credits. The skills can load but cannot *execute* the API calls they wrap. Verify terminal is enabled AND the relevant API keys are populated. The 2026-05-17 backfill (commit `faa92e8` in Astral) re-enabled terminal for tenants and injected `ASTRAL_TENANT_FIRECRAWL_API_KEY`.

4. **Subagent timeout when adding a feature to one of these apps**: see `subagent-driven-development` skill's pre-flight discovery section. Both repos accumulate "almost finished" features that need enabling rather than rebuilding. Always grep first.

5. **Telegram bot prefills credentials**: Public onboarding/login UIs MUST NOT prefill remembered IDs, emails, or default passwords. Alex is security-conscious about this — explicit user-profile note.

6. **`mayaastral` tenant has a config shape difference**: discovered during 2026-05-17 backfill. If a backfill loop succeeds for most tenants but `mayaastral` shows missing `platform_toolsets.terminal`, that tenant needs a manual pass.

7. **Agent claims to write to KB but file never appears on host**: The `astral-tenant-kb` skill reads `ASTRAL_KB_ROOT` from the **process environment**, not the `.env` file. Hermes does NOT auto-load `/data/hermes/.env` into the child Python processes that skills spawn. Result: if `-e ASTRAL_KB_ROOT=/data/hermes/knowledge` is missing from `docker run`, the skill silently falls back to `~/.hermes/astral-kb-prototype/knowledge/` inside the writable container layer and the agent reports success while nothing persists to the mounted volume.
   - **Fix**: bake the env var into the `docker run` line in BOTH `web/server.mjs` (provisioning script ~line 1608) AND `bin/astral.mjs` (`bundle install --restart` path, line 178). Adding to `.env` is necessary but not sufficient; existing containers need `docker rm -f && docker run` (not just `docker restart`) because env can only be set at container creation.
   - **Verify**: `docker exec <container> env | grep ASTRAL_KB_ROOT` → must show the path. Then ssh write a probe file into the tenant's `knowledge/raw/.probe` and confirm the container sees it via `docker exec <container> ls /data/hermes/knowledge/raw/`.
   - **Detection signal**: agent says \"saved Kathleen Brown's chart to your knowledge base\" but `find /srv/astral/tenants/<id>/hermes-home/knowledge/entities/people/` is empty. Same root cause every time. Commit `457b9aa` in Astral.

8. **Tenant has no `knowledge/` directory at all**: Older tenants (notably `mayaastral`, the `CHAT_DEFAULT_TENANT`) were provisioned before the KB scaffold became a default. The scaffold is created by `createKnowledgeScaffold({ tenantId, tenantName })` in `web/server.mjs`. Backfill via the admin endpoint:
   ```bash
   curl -X POST -b \"$ADMIN_COOKIE\" \\\n     https://astral.apps.poofc.com/api/admin/tenants/<id>/ensure-knowledge\n   ```\n   Or via the one-off helper at `scripts/scaffold-kb.mjs`. The endpoint uses `if not p.exists()` guards so it's idempotent and safe to re-run.

9. **Cross-tenant isolation is real but verify after changes**: Each tenant has its own `<TENANT_ROOT>/<id>/hermes-home/` mounted at `/data/hermes` inside its dedicated container (`astral-tenant-<id>`). No shared volumes. Containers run with `--cap-drop=ALL`, `--security-opt no-new-privileges:true`, and `--memory=1g`. If you ever rewrite container creation, verify isolation by: (a) writing a probe file in tenant A's KB on the host, (b) listing the same path from tenant B's container — must be empty. `requireChatTenantAccess` and `requireTenantOwner` middleware also enforce app-level gates.

10. **Pre-existing helpers exist for \"new\" features — search before delegating**: Recurring pattern in this codebase: Anthropic OAuth PKCE helpers (`installAnthropicOauthCredential`, `exchangeAnthropicAuthorization`, `buildAnthropicAuthorizeUrl`, `generateAnthropicPkce`) were already built but disabled via `enabled: false` in `src/provider-matrix.mjs`. Similarly, KB ingestion sessions existed in Spawn before they were surfaced. Always `grep -rn '<feature>' web/server.mjs src/` BEFORE delegating implementation work; the subagent timeouts in this session were directly caused by re-exploring existing code.

## Wizard UX principles (Astral onboarding — Typeform-style, landed 2026-05-22, commit 36023bd)

Alex's explicit preference for any multi-step wizard in these apps:

- **One screen, one decision, one button.** A wizard step that ends with 7 parallel buttons (\"Open chat\", \"Open Stripe\", \"Open account\", \"Provision another\", \"Open device auth\"...) is a UX failure. Every screen has exactly ONE primary CTA. Secondary actions are de-emphasized text links or hidden.
- **Sequential, not dashboard.** Post-provisioning is NOT a success page with multiple actions — it's the next steps as additional wizard steps: `provisioning` (spinner) → `providerAuth` (conditional) → `billing` (conditional Stripe) → `ready` (brief confirmation + auto-redirect to /chat). The wizard never ends with a menu; it always proceeds to the next screen or to the actual product.
- **Skip steps silently when not needed.** API-key providers skip credential-OAuth step. Non-Stripe-configured deployments skip billing. The progress bar recomputes from rendered steps, not from a static count.
- **No raw stderr in user-facing UI.** Provisioning errors collapse to a friendly retry message; admins can expand a `<details>` for the actual stderr. Anyone seeing `astral-tenant-hhfggg Up 3 seconds astral-hermes-runner:dev` in the UI is the bug.
- **Auto-derive what you can.** Tenant ID auto-slugifies from agent name — never make the user type both. Owner email/name fall back to the logged-in session account.
- **Admins are NOT exempt from sandbox Stripe.** Until the Stripe keys flip from sandbox to live, admins go through the same checkout flow as users. This is intentional: it validates the flow end-to-end before live keys turn on. Don't add an admin-bypass branch.
- **Stripe success/cancel URLs route back into the wizard.** `success_url: /onboarding?tenant=<id>&checkout=ok`, `cancel_url: /onboarding?tenant=<id>&checkout=cancel`. The wizard reads these query params on mount and either advances to `ready` step or stays on `billing`.
- **No Telegram in onboarding.** Telegram setup belongs in the tenant settings panel post-provision. A dismissable nudge banner on `/chat` and `/account` (\"Want Telegram access? Set it up in settings →\") deep-links to `/account/tenant/<id>#telegram` for tenants without it configured. Telegram is fiddly (BotFather + numeric user IDs) and shouldn't gate first-chat.

## Tenant settings panel (Astral — landed 2026-05-22, commit 82887f3)

Route: `/account/tenant/<tenantId>`. Tabs: provider, telegram, danger zone.

Endpoints (all gated by `requireTenantOwner` which allows either the owning account OR any admin):

- `GET /api/tenant/:tenantId/settings` — masked snapshot (provider id + model + has-api-key + masked-fingerprint, telegram-configured boolean, entitlement status). DO NOT return raw token or user-ID list.
- `POST /api/tenant/:tenantId/provider` — change provider (writes new `config.yaml` + `.env`, restarts container). Returns `providerAuthRequired: true` if OAuth-based so UI prompts re-auth next.
- `POST /api/tenant/:tenantId/provider/key` — update API key only. Validates current provider is api-key-based.
- `POST /api/tenant/:tenantId/telegram` — write or replace bot config. Calls `lookupTelegramBotUsername(token)` to verify the token works before saving.
- `DELETE /api/tenant/:tenantId/telegram` — wipe the three `TELEGRAM_*` keys from .env, restart container.

Key helpers:

- `mergeEnvFile(existing, updates, removeKeys)` preserves comments and key ordering, deduplicates keys.
- `readTenantEnv(tenantId)` / `writeTenantEnvAndRestart(tenantId, newContent)` use base64 + python heredoc + atomic temp+rename for safe writes with 0600 perms.
- Container restart: `docker restart astral-tenant-<id>` (faster than rm+run) UNLESS you're changing env vars — then full recreate.

OAuth re-auth from settings panel just calls the existing `/api/provider/device-auth` + `/api/provider/complete` (Codex) and `/api/provider/anthropic/start` + `/api/provider/anthropic/complete` (Anthropic) endpoints — they already work for existing tenants, not just new ones. `installCodexCredential` and `installAnthropicOauthCredential` overwrite the auth.json credential, which is the correct behavior for re-auth.

## Chat UI patterns (Astral — landed 2026-05-22, commits dfe72d1 + edcb711)

Alex's explicit preferences for the chat surface:

- **Assistant messages: full-width, no bubble.** ChatGPT-style. Plain text/markdown rendering with `white-space: pre-wrap`. CSS class targets: `.chat-bubble.assistant:not(.audio):not(.error):not(.typing)` gets `width:100%`, transparent background, no border/shadow, minimal padding.
- **User messages: keep the bubble.** Right-aligned, max-width ~75%. Audio messages also bubble (they have the player UI).
- **Error/system messages: subtle styled box.** Not full-width, not full bubble.
- **Voice lifecycle feedback (4 phases, not one)**: `uploading…` (pre-fetch, client) → `transcribing…` (server `transcribing` event) → `sent ✓` (server `transcribed` event) → `received ✓` (server `done` event). Drive this via SSE events, not artificial timers.
- **Tool-call streaming via SSE.** Both `/api/chat/message` and `/api/chat/voice` support SSE when `Accept: text/event-stream`. Server drops the `-Q` (quiet) flag on `hermes chat`, spawns via `sshStream()` (line-buffered stdout), parses verbose output via `parseHermesProgressLine()`, and emits events `transcribing | transcribed | start | tool_call | progress | text | done | error`. JSON contract preserved when Accept header is absent — backward compatible. Client shows progress events as small italic gray lines below the assistant response (Telegram-style activity log).
- **Sidebar layout on /chat**: Return-to-admin button lives INSIDE the chat sidebar as a header item for admins (not as a floating fixed pill). Floating `ReturnToAdminPill` self-hides on `/chat` to avoid double-rendering. Account + Settings buttons live next to each other in `.sidebar-action-links` for easy access.

## References

- `references/codex-auth-shape-fix.md` — the codyguy/cody1 "No Codex credentials stored" root cause and dual-write fix.
- `references/provider-config-cheatsheet.md` — quick reference for the `providerConfig()` shape per provider (envName, config block, post-create requirements).
- `references/admin-role-overhaul-astral.md` — file-by-file checklist of the role-based admin migration in Astral, useful when porting to Spawn.
- `references/kb-env-var-in-docker-run.md` — root cause and fix for "agent claims to save to KB but file never appears on host" (the `ASTRAL_KB_ROOT` `.env`-vs-`docker run` lesson).
- `references/typeform-wizard-principles.md` — the principles Alex explicitly approved for the Astral onboarding rewrite. Apply to any future wizard in either app.