---
name: multi-provider-api-resilience
description: Build resilient API integrations with automatic provider fallback. Covers LLM providers (Anthropic→OpenRouter), TTS providers (ElevenLabs→OpenAI), and image/video gen APIs. Patterns for rate limits, quota exhaustion, auth errors, and idempotent retries.
version: 1.0.0
tags: [api, resilience, fallback, openrouter, elevenlabs, openai, anthropic, tts, llm]
metadata:
  hermes:
    tags: [api, resilience, fallback, openrouter, elevenlabs, openai, anthropic, tts, llm]
---

# Multi-Provider API Resilience

Patterns for building API integrations that automatically fall back between providers when the primary fails.

See `references/openai-subscription-vs-api-key-audio.md` for a concrete case where subscription/device auth powered text chat but voice transcription still used an API-key capability path and needed sanitized quota handling.

## When to Use

- App calls external APIs (LLM, TTS, image gen, video gen) that can fail
- User has multiple API keys for similar services
- Rate limits, quota exhaustion, or auth errors are likely
- Long-running pipelines where partial failure wastes money

## Pattern 1: LLM Fallback (Anthropic → OpenRouter)

### Key Pitfalls Discovered

1. **OpenRouter is NOT Anthropic-compatible** — it uses the OpenAI chat completions format (`/v1/chat/completions`), NOT Anthropic's messages API. You CANNOT just point the Anthropic SDK at OpenRouter's base URL — it will 404.

2. **OpenRouter model IDs differ from Anthropic** — e.g., `claude-sonnet-4-20250514` on Anthropic is `anthropic/claude-sonnet-4.6` on OpenRouter. Use the OpenRouter models endpoint to find correct IDs: `curl -s https://openrouter.ai/api/v1/models | python3 -c "import sys,json; [print(m['id']) for m in json.load(sys.stdin)['data'] if 'sonnet' in m['id']]"`

3. **Pass through max_tokens** — If the OpenRouter fallback hardcodes `max_tokens: 4096` but the caller requested 8192, structured JSON responses get truncated. Always use `params.max_tokens || 8192` in the fallback.

4. **OpenRouter app headers matter** — Include `HTTP-Referer` and `X-Title` on OpenRouter requests. Some app deployments work without them during ad hoc curls but fail or lose attribution/routing in production. For Alex's VPS apps, set `HTTP-Referer` to the public app URL and `X-Title` to the app name.

5. **Fallback on ANY error, not just 429** — Anthropic/OpenAI/Venice keys can fail with 401 (invalid/expired), 402 (payment/credits), 429 (rate limit), 500 (server error), etc. If you have a fallback available, use it for all errors.

### Implementation (Node.js)

```javascript
import Anthropic from '@anthropic-ai/sdk'
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })

async function callClaude(params) {
  try {
    return await anthropic.messages.create(params)
  } catch (err) {
    if (process.env.OPENROUTER_API_KEY) {
      console.log(`Anthropic error (${err?.status}). Falling back to OpenRouter...`)
      return await callOpenRouter(params)
    }
    throw err
  }
}

// OpenRouter uses OpenAI format — must convert manually
async function callOpenRouter(params) {
  const messages = []
  if (params.system) messages.push({ role: 'system', content: params.system })
  for (const msg of params.messages) messages.push({ role: msg.role, content: msg.content })

  const res = await fetch('https://openrouter.ai/api/v1/chat/completions', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.OPENROUTER_API_KEY}`,
      'Content-Type': 'application/json',
      'HTTP-Referer': 'https://your-app.com',
    },
    body: JSON.stringify({
      model: 'anthropic/claude-sonnet-4.6',  // OpenRouter model ID format
      messages,
      max_tokens: params.max_tokens || 8192,
    })
  })

  if (!res.ok) throw new Error(`OpenRouter error ${res.status}: ${await res.text()}`)
  const data = await res.json()
  // Normalize to Anthropic response shape
  return { content: [{ type: 'text', text: data.choices?.[0]?.message?.content || '' }] }
}
```

## Pattern 2: TTS Fallback (ElevenLabs → OpenAI)

### Key Pitfalls

1. **ElevenLabs burns credits fast** — a 2-minute story can need 188+ credits. Check quota BEFORE starting a batch.
2. **ElevenLabs quota errors are 401 not 429** — the error is `{"status":"quota_exceeded"}` with HTTP 401, not a rate limit.
3. **Voice ID mapping** — when switching providers, map voice IDs between them (ElevenLabs IDs are UUIDs, OpenAI uses names like 'onyx', 'nova', 'echo').

### OpenAI TTS Voices (cheaper alternative)
| Voice | Character |
|-------|-----------|
| onyx | Deep authoritative male (narrator) |
| echo | Younger male |
| fable | British, older feel |
| nova | Young energetic female |
| shimmer | Mature female |
| alloy | Neutral, slightly older female |

### Implementation
```javascript
const res = await fetch('https://api.openai.com/v1/audio/speech', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'tts-1-hd',
    input: text,
    voice: 'onyx',  // or echo, fable, nova, shimmer, alloy
    response_format: 'mp3',
  })
})
const buffer = Buffer.from(await res.arrayBuffer())
fs.writeFileSync(outputPath, buffer)
```

## Pattern 3: Streaming/SSE Provider Failures

For routes that stream LLM output to the browser (Server-Sent Events, chunked responses), do not rely on normal Express error middleware after headers have been sent. Once the response is streaming, an upstream provider failure can otherwise leave the UI blank or waiting forever.

Server pattern:
```javascript
try {
  await streamProviderOutput(req, res)
} catch (error) {
  if (res.headersSent) {
    const msg = (error.message || 'Generation failed').replace(/\n/g, ' ')
    res.write(`data: PROVIDER_ERROR:${msg}\n\n`)
    res.write('data: END_OF_ALL_MESSAGES\n\n')
    return res.end()
  }
  next(error)
}
```

Frontend pattern:
```javascript
const source = new EventSource(url)
source.onmessage = (event) => {
  if (event.data.startsWith('PROVIDER_ERROR:')) {
    showUserVisibleError(event.data.replace('PROVIDER_ERROR:', '').trim())
    source.close()
    return
  }
  if (event.data === 'END_OF_ALL_MESSAGES') source.close()
}
source.onerror = () => {
  showUserVisibleError('The stream disconnected. Please try again.')
  source.close()
}
```

Build stream URLs with `URLSearchParams`; raw interpolation breaks names/locations/prompts containing spaces, ampersands, or slashes.

## Pattern 4: Idempotent Pipeline Segments

When a pipeline processes items sequentially and can fail mid-batch (e.g., generating audio for 40 segments), make it resumable:

```javascript
for (const seg of segments) {
  // Skip if this segment already has valid output
  if (fs.existsSync(seg.outputPath) && fs.statSync(seg.outputPath).size > 1000) {
    console.log(`Skipping segment ${seg.id} (already exists)`)
    skipped++
    continue
  }
  // Generate only what's missing
  const result = await generateSegment(seg)
  generated++
}
```

This prevents wasting money re-generating items that succeeded before the failure.

## Pattern 5: PM2 and dotenv

**Critical pitfall:** PM2 does NOT inherit the working directory of the shell that starts it. If you run `pm2 start server/index.js` from a different directory, `dotenv/config` will look for `.env` in the wrong place.

**Fix:** Always specify `--cwd`:
```bash
pm2 start /path/to/server/index.js --name app-name --cwd /path/to/app
```

Verify with: `pm2 show app-name | grep "exec cwd"`

## Pattern 6: Replicate SDK FileOutput

The Replicate Node.js SDK (`replicate.run()`) returns a `FileOutput` object, NOT a string URL. Calling `.url()` on it returns a `URL` object, NOT a string. SQLite and other storage that expects strings will throw: `SQLite3 can only bind numbers, strings, bigints, buffers, and null`.

```javascript
const output = await replicate.run('black-forest-labs/flux-2-pro', { input })

// output is a FileOutput (ReadableStream with .url() method)
if (typeof output === 'string') return output
if (output?.url) {
  const url = output.url()
  return typeof url === 'string' ? url : url.href || String(url)
}
return String(output)
```

**Always extract `.href`** from the URL object to get a plain string before storing in a database.

## Pattern 7: Background Processing with Progress Polling

For bulk API operations (e.g., generating 60 images), respond immediately and process in background. The frontend polls a progress endpoint:

```javascript
// Server: respond immediately, process in background
app.post('/api/projects/:id/generate-all', async (req, res) => {
  res.json({ message: 'Generating...', total: items.length })
  
  // Background IIFE
  ;(async () => {
    for (const item of items) {
      try { await generateItem(item) }
      catch (err) { console.error(`Failed ${item.id}:`, err.message) }
    }
  })()
})

// Progress endpoint
app.get('/api/projects/:id/progress', (req, res) => {
  const done = db.prepare('SELECT COUNT(result_url) as done FROM items WHERE project_id = ?').get(id)
  res.json(done)
})
```

Frontend polls every 5 seconds during generation and stops when done === total.

## Pattern 8: OpenAI/Codex subscription image generation with fallback

Some apps should use Alex's OpenAI/ChatGPT subscription auth for image generation/editing instead of an OpenAI platform API key. In Hermes Agent this is the `openai-codex` image-gen provider: it calls the ChatGPT/Codex backend with a normal chat model hosting the `image_generation` tool (`gpt-image-2`). This is capability-specific auth: do not assume `OPENAI_API_KEY` is available or desired.

Implementation rules:
- Distinguish provider/model IDs such as `openai-codex/gpt-image-2` from FAL model IDs.
- For image edits, pass reference images as `input_image` message content to the Codex Responses stream; for FAL fallbacks, upload local files and send `image_urls`/`image_url` in that model's schema.
- Keep the UI model picker separate from the default: default can be OpenAI/Codex while allowing FAL or high-quality GPT Image variants.
- On OpenAI/Codex failure, fallback to a known reference-capable image edit model (for Hermes Creative currently `fal-ai/nano-banana-2/edit`) and persist both `requested_model` and actual provider/model.
- Do not spend image-generation credits for smoke tests unless the user approves; verify auth/config/build/model catalogs first.

## Pattern 9: Image Generation with Multi-Reference Consistency

When generating sequential frames (first frame → last frame for video), pass the first frame as a reference image for the last frame to maintain visual consistency:

```javascript
const refImages = [...characterRefs, ...setRefs]

// For last frame, add first frame as highest-priority reference
if (frameType === 'last' && shot.first_frame_url) {
  refImages.unshift(shot.first_frame_url)
}

const imageUrl = await generateImageFlux(prompt, refImages, { aspect_ratio: '16:9' })
```

FLUX.2 Pro supports up to 8 reference images. Priority order matters — put the most important reference first.

## Pattern 10: Rate-Limited Parallel Queue (Video/GPU Workloads)

For expensive, slow API calls (video generation ~$0.07 each, ~45s), use an in-memory queue with configurable concurrency instead of sequential processing or unlimited parallelism:

```javascript
const MAX_CONCURRENT = 3
let activeJobs = 0
const queue = []

function enqueueJob(id) {
  if (queue.some(j => j.id === id)) return // no duplicates
  queue.push({ id })
  processQueue()
}

async function processQueue() {
  while (queue.length > 0 && activeJobs < MAX_CONCURRENT) {
    const job = queue.shift()
    activeJobs++
    processJob(job.id)
      .catch(err => console.error(`Failed ${job.id}:`, err.message))
      .finally(() => { activeJobs--; processQueue() })
  }
}
```

Key design points:
- **Skip complete items** — only queue pending/failed, never overwrite successful results
- **Mark status in DB** — pending → generating → complete/failed (UI polls this)
- **Fire-and-forget with finally** — each job runs independently, slot freed on completion or failure
- **No duplicates** — check queue before adding
- **Cost estimation in UI** — show `${pendingCount} × $0.07 = $X.XX` before starting

## Pattern 11: LLM JSON Repair for Truncated Responses

When LLMs generate large structured JSON (especially via fallback providers that may have lower effective token limits), responses can get truncated mid-string. Add a repair function:

```javascript
function safeParseJSON(text) {
  try {
    return JSON.parse(text)
  } catch (e) {
    console.log('JSON parse failed, attempting repair...')
    let fixed = text
    const lastCloseBrace = fixed.lastIndexOf('}')
    if (lastCloseBrace > 0) {
      fixed = fixed.substring(0, lastCloseBrace + 1)
      if (!fixed.trim().endsWith(']')) fixed = fixed + ']'
    }
    try { return JSON.parse(fixed) }
    catch (e2) { throw new Error(`Invalid JSON from LLM: ${e.message}`) }
  }
}
```

Also: when LLM needs to return timestamps that match audio segments, compute them from segment indices as a fallback — LLMs frequently omit or miscalculate timestamp fields even when instructed.

## Pattern 12: ffmpeg Assembly for AI-Generated Video Clips

AI video models (Wan 2.2, etc.) output clips with inconsistent formats. Normalize before concatenating:

```bash
# Normalize to 720p, 24fps, h264, no audio
ffmpeg -y -i input.mp4 \
  -vf "scale=1280:720:force_original_aspect_ratio=decrease,pad=1280:720:(ow-iw)/2:(oh-ih)/2" \
  -r 24 -c:v libx264 -preset fast -crf 23 -an -movflags +faststart output.mp4

# Concat with list file
ffmpeg -y -f concat -safe 0 -i concat.txt -c:v libx264 -preset fast -crf 23 video_only.mp4

# Merge video + audio (use -shortest if durations don't match)
ffmpeg -y -i video_only.mp4 -i narration.mp3 -c:v copy -c:a aac -b:a 192k -shortest final.mp4
```

Cache downloaded clips locally so re-exports don't re-download from expiring Replicate URLs.

## Pattern 13: Image Generation Safety Filter Cascade

AI image generation models have multi-layer safety filtering that can block legitimate creative content (mythology, dramatic scenes, historical violence). Different providers have different filter strictness AND different filter architecture:

### The Three Layers (discovered empirically)
1. **Provider-level output filter** (e.g., Replicate) — runs AFTER image generation, classifies the output image. Cannot be bypassed by prompt changes. Character names from mythology (Zahhak, Ahriman) can cause the MODEL to generate imagery that triggers this even with perfectly clean prompts.
2. **Provider-level input filter** (e.g., fal.ai) — runs BEFORE generation on the prompt text. `enable_safety_checker: false` only disables the OUTPUT filter. The INPUT filter is separate and cannot be disabled. Blocks certain name+context combinations.
3. **Model-level generation** — some models (Qwen) have no safety filtering at all on fal.ai.

### Implementation: Three-Tier Cascade
```javascript
async function generateWithFallback(prompt, options) {
  // Tier 1: Replicate (fast, cheap) — output filter can block
  try {
    return await generateImageFlux(prompt, refs, { safety_tolerance: 5, ...options })
  } catch (err) {
    if (!isSafetyError(err)) throw err
  }

  // Tier 2: fal.ai FLUX (output filter disabled) — input filter can still block
  try {
    return await generateImageFluxFal(prompt, refs, options)
    // Uses enable_safety_checker: false, safety_tolerance: 5
  } catch (err) {
    if (!isSafetyError(err)) throw err
  }

  // Tier 3: Qwen on fal.ai (no safety filter at all) — last resort
  return await generateImageQwenFal(prompt, options)
  // fal-ai/qwen-image-2/text-to-image — no safety params needed
}

function isSafetyError(err) {
  const msg = err?.message || ''
  return msg.includes('flagged as sensitive') || msg.includes('safety') ||
         msg.includes('NSFW') || msg.includes('content_policy_violation')
}
```

### Key Lessons
- **Never retry the same provider** — if Replicate's output filter blocks a prompt at max safety_tolerance, it will block every retry. Switch providers.
- **Prompt rewriting helps but isn't sufficient** — the output classifier flags the generated IMAGE, not the prompt. A perfectly clean prompt about "a king in ceremonial robes" can still produce an image the classifier rejects.
- **Log which provider succeeded** — essential for debugging. "⚡ FLUX blocked, trying Qwen..." tells you the filter pattern.
- **Save prompts BEFORE generation** — if you only save the prompt on success, retry loops can't find failed items to rewrite. This was a critical bug: retry queried `WHERE prompt IS NOT NULL` but prompt was only saved in the same UPDATE as the result URL.

## Pattern 14: Auth Domain Separation and Public Error Sanitization

Some products have multiple authentication domains that look similar from the UI but are not interchangeable at the API layer. Example: a tenant may use `openai-codex` / ChatGPT subscription device auth for text chat, while voice transcription still calls OpenAI's platform `/audio/transcriptions` endpoint with an API key. If transcription falls back to a control-plane API key, quota/billing errors belong to that fallback key, not the tenant's subscription.

Implementation rules:
- Trace provider credentials per capability (`chat`, `transcription`, `vision`, `embeddings`) instead of assuming one tenant auth method covers all provider APIs.
- Name fallback variables by capability, e.g. `TRANSCRIPTION_OPENAI_KEY`, so logs/config make the boundary obvious.
- Do not expose raw provider quota/billing text to end users. Convert `insufficient_quota`, billing-plan errors, 402/429 quota text, and provider docs URLs into a product-safe message such as `Voice transcription is temporarily unavailable. Text chat still works.`
- Keep detailed provider errors in server logs with secrets redacted, and add a regression test that asserts public JSON does not include `insufficient_quota`, provider docs URLs, API keys, or billing internals.

Node helper pattern:
```javascript
function isOpenAiQuotaError(err) {
  const text = `${err?.status || ''} ${err?.code || ''} ${err?.message || ''} ${err?.response?.data || ''}`.toLowerCase()
  return text.includes('insufficient_quota') ||
    text.includes('quota') ||
    text.includes('billing') ||
    text.includes('usage limits')
}

function publicVoiceError(err) {
  if (isOpenAiQuotaError(err)) {
    return 'Voice transcription is temporarily unavailable. Text chat still works.'
  }
  return 'Voice transcription failed. Please try again or send text instead.'
}
```

## Pattern 15: Quota-Aware Degraded Modes Must Be Honest

A quota fallback can change more than the provider name: it may change identity, background, session duration, model quality, supported commands, or whether the result is conversational at all. Do not normalize away differences that the UI/user must know.

Implementation rules:
- Check quota/credits before expensive session or batch creation, but still handle the actual create/start error because balances can change between preflight and use.
- Define an explicit acceptable degraded contract. A documented no-credit sandbox may be appropriate for development continuity; a static image is not an equivalent fallback for an interactive avatar.
- Return safe runtime metadata with the normalized result, e.g. `provider`, `mode`, `sandbox`, `presenter_name`, and effective duration/capabilities.
- Make the frontend label and render the actual fallback. Never display the primary identity or claim “speaking” when a different sandbox presenter is active or when no provider event occurred.
- Verify the fallback end to end independently: token/create, media/data output, user-visible behavior, teardown, and zero/expected credit consumption.
- Automatically restore the primary when quota returns, while keeping both paths covered by contract tests.

For realtime-avatar quota, identity, and opaque-background handling, see `video-media-production-operations/references/realtime-avatar-web-compositing.md`.

## Checklist

- [ ] Capability-specific auth path is documented (`chat` vs `transcription` vs `vision`, etc.)
- [ ] Primary provider call wrapped in try/catch
- [ ] Fallback on any error (not just specific status codes)
- [ ] Response normalized to primary provider's shape
- [ ] Provider-specific model IDs mapped correctly
- [ ] OpenRouter requests include app attribution headers when used
- [ ] Streaming/SSE routes send explicit provider-error events after headers are sent
- [ ] Frontend stream clients show user-visible errors and close disconnected streams
- [ ] Stream URLs are composed with `URLSearchParams`
- [ ] Voice/resource IDs mapped between providers
- [ ] Pipeline segments are idempotent (skip existing outputs)
- [ ] PM2 started with correct --cwd for dotenv
- [ ] Costs estimated before batch operations