multi-provider-api-resilience

/home/avalon/.hermes/skills/software-development/multi-provider-api-resilience/SKILL.md · raw

Multi-Provider API Resilience

Patterns for building API integrations that automatically fall back between providers when the primary fails.

See references/openai-subscription-vs-api-key-audio.md for a concrete case where subscription/device auth powered text chat but voice transcription still used an API-key capability path and needed sanitized quota handling.

When to Use

App calls external APIs (LLM, TTS, image gen, video gen) that can fail
User has multiple API keys for similar services
Rate limits, quota exhaustion, or auth errors are likely
Long-running pipelines where partial failure wastes money

Pattern 1: LLM Fallback (Anthropic → OpenRouter)

Key Pitfalls Discovered

OpenRouter is NOT Anthropic-compatible — it uses the OpenAI chat completions format (/v1/chat/completions), NOT Anthropic's messages API. You CANNOT just point the Anthropic SDK at OpenRouter's base URL — it will 404.
OpenRouter model IDs differ from Anthropic — e.g., claude-sonnet-4-20250514 on Anthropic is anthropic/claude-sonnet-4.6 on OpenRouter. Use the OpenRouter models endpoint to find correct IDs: curl -s https://openrouter.ai/api/v1/models | python3 -c "import sys,json; [print(m['id']) for m in json.load(sys.stdin)['data'] if 'sonnet' in m['id']]"
Pass through max_tokens — If the OpenRouter fallback hardcodes max_tokens: 4096 but the caller requested 8192, structured JSON responses get truncated. Always use params.max_tokens || 8192 in the fallback.
OpenRouter app headers matter — Include HTTP-Referer and X-Title on OpenRouter requests. Some app deployments work without them during ad hoc curls but fail or lose attribution/routing in production. For Alex's VPS apps, set HTTP-Referer to the public app URL and X-Title to the app name.
Fallback on ANY error, not just 429 — Anthropic/OpenAI/Venice keys can fail with 401 (invalid/expired), 402 (payment/credits), 429 (rate limit), 500 (server error), etc. If you have a fallback available, use it for all errors.

Implementation (Node.js)

import Anthropic from '@anthropic-ai/sdk'
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })

async function callClaude(params) {
  try {
    return await anthropic.messages.create(params)
  } catch (err) {
    if (process.env.OPENROUTER_API_KEY) {
      console.log(`Anthropic error (${err?.status}). Falling back to OpenRouter...`)
      return await callOpenRouter(params)
    }
    throw err
  }
}

// OpenRouter uses OpenAI format — must convert manually
async function callOpenRouter(params) {
  const messages = []
  if (params.system) messages.push({ role: 'system', content: params.system })
  for (const msg of params.messages) messages.push({ role: msg.role, content: msg.content })

  const res = await fetch('https://openrouter.ai/api/v1/chat/completions', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.OPENROUTER_API_KEY}`,
      'Content-Type': 'application/json',
      'HTTP-Referer': 'https://your-app.com',
    },
    body: JSON.stringify({
      model: 'anthropic/claude-sonnet-4.6',  // OpenRouter model ID format
      messages,
      max_tokens: params.max_tokens || 8192,
    })
  })

  if (!res.ok) throw new Error(`OpenRouter error ${res.status}: ${await res.text()}`)
  const data = await res.json()
  // Normalize to Anthropic response shape
  return { content: [{ type: 'text', text: data.choices?.[0]?.message?.content || '' }] }
}

Pattern 2: TTS Fallback (ElevenLabs → OpenAI)

Key Pitfalls

ElevenLabs burns credits fast — a 2-minute story can need 188+ credits. Check quota BEFORE starting a batch.
ElevenLabs quota errors are 401 not 429 — the error is {"status":"quota_exceeded"} with HTTP 401, not a rate limit.
Voice ID mapping — when switching providers, map voice IDs between them (ElevenLabs IDs are UUIDs, OpenAI uses names like 'onyx', 'nova', 'echo').

OpenAI TTS Voices (cheaper alternative)

Voice	Character
onyx	Deep authoritative male (narrator)
echo	Younger male
fable	British, older feel
nova	Young energetic female
shimmer	Mature female
alloy	Neutral, slightly older female

Implementation

const res = await fetch('https://api.openai.com/v1/audio/speech', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'tts-1-hd',
    input: text,
    voice: 'onyx',  // or echo, fable, nova, shimmer, alloy
    response_format: 'mp3',
  })
})
const buffer = Buffer.from(await res.arrayBuffer())
fs.writeFileSync(outputPath, buffer)

Pattern 3: Streaming/SSE Provider Failures

For routes that stream LLM output to the browser (Server-Sent Events, chunked responses), do not rely on normal Express error middleware after headers have been sent. Once the response is streaming, an upstream provider failure can otherwise leave the UI blank or waiting forever.

Server pattern:

try {
  await streamProviderOutput(req, res)
} catch (error) {
  if (res.headersSent) {
    const msg = (error.message || 'Generation failed').replace(/\n/g, ' ')
    res.write(`data: PROVIDER_ERROR:${msg}\n\n`)
    res.write('data: END_OF_ALL_MESSAGES\n\n')
    return res.end()
  }
  next(error)
}

Frontend pattern:

const source = new EventSource(url)
source.onmessage = (event) => {
  if (event.data.startsWith('PROVIDER_ERROR:')) {
    showUserVisibleError(event.data.replace('PROVIDER_ERROR:', '').trim())
    source.close()
    return
  }
  if (event.data === 'END_OF_ALL_MESSAGES') source.close()
}
source.onerror = () => {
  showUserVisibleError('The stream disconnected. Please try again.')
  source.close()
}

Build stream URLs with URLSearchParams; raw interpolation breaks names/locations/prompts containing spaces, ampersands, or slashes.

Pattern 4: Idempotent Pipeline Segments

When a pipeline processes items sequentially and can fail mid-batch (e.g., generating audio for 40 segments), make it resumable:

for (const seg of segments) {
  // Skip if this segment already has valid output
  if (fs.existsSync(seg.outputPath) && fs.statSync(seg.outputPath).size > 1000) {
    console.log(`Skipping segment ${seg.id} (already exists)`)
    skipped++
    continue
  }
  // Generate only what's missing
  const result = await generateSegment(seg)
  generated++
}

This prevents wasting money re-generating items that succeeded before the failure.

Pattern 5: PM2 and dotenv

Critical pitfall: PM2 does NOT inherit the working directory of the shell that starts it. If you run pm2 start server/index.js from a different directory, dotenv/config will look for .env in the wrong place.

Fix: Always specify --cwd:

pm2 start /path/to/server/index.js --name app-name --cwd /path/to/app

Verify with: pm2 show app-name | grep "exec cwd"

Pattern 6: Replicate SDK FileOutput

The Replicate Node.js SDK (replicate.run()) returns a FileOutput object, NOT a string URL. Calling .url() on it returns a URL object, NOT a string. SQLite and other storage that expects strings will throw: SQLite3 can only bind numbers, strings, bigints, buffers, and null.

const output = await replicate.run('black-forest-labs/flux-2-pro', { input })

// output is a FileOutput (ReadableStream with .url() method)
if (typeof output === 'string') return output
if (output?.url) {
  const url = output.url()
  return typeof url === 'string' ? url : url.href || String(url)
}
return String(output)

Always extract .href from the URL object to get a plain string before storing in a database.

Pattern 7: Background Processing with Progress Polling

For bulk API operations (e.g., generating 60 images), respond immediately and process in background. The frontend polls a progress endpoint:

// Server: respond immediately, process in background
app.post('/api/projects/:id/generate-all', async (req, res) => {
  res.json({ message: 'Generating...', total: items.length })

  // Background IIFE
  ;(async () => {
    for (const item of items) {
      try { await generateItem(item) }
      catch (err) { console.error(`Failed ${item.id}:`, err.message) }
    }
  })()
})

// Progress endpoint
app.get('/api/projects/:id/progress', (req, res) => {
  const done = db.prepare('SELECT COUNT(result_url) as done FROM items WHERE project_id = ?').get(id)
  res.json(done)
})

Frontend polls every 5 seconds during generation and stops when done === total.

Pattern 8: OpenAI/Codex subscription image generation with fallback

Some apps should use Alex's OpenAI/ChatGPT subscription auth for image generation/editing instead of an OpenAI platform API key. In Hermes Agent this is the openai-codex image-gen provider: it calls the ChatGPT/Codex backend with a normal chat model hosting the image_generation tool (gpt-image-2). This is capability-specific auth: do not assume OPENAI_API_KEY is available or desired.

Implementation rules: - Distinguish provider/model IDs such as openai-codex/gpt-image-2 from FAL model IDs. - For image edits, pass reference images as input_image message content to the Codex Responses stream; for FAL fallbacks, upload local files and send image_urls/image_url in that model's schema. - Keep the UI model picker separate from the default: default can be OpenAI/Codex while allowing FAL or high-quality GPT Image variants. - On OpenAI/Codex failure, fallback to a known reference-capable image edit model (for Hermes Creative currently fal-ai/nano-banana-2/edit) and persist both requested_model and actual provider/model. - Do not spend image-generation credits for smoke tests unless the user approves; verify auth/config/build/model catalogs first.

Pattern 9: Image Generation with Multi-Reference Consistency

When generating sequential frames (first frame → last frame for video), pass the first frame as a reference image for the last frame to maintain visual consistency:

const refImages = [...characterRefs, ...setRefs]

// For last frame, add first frame as highest-priority reference
if (frameType === 'last' && shot.first_frame_url) {
  refImages.unshift(shot.first_frame_url)
}

const imageUrl = await generateImageFlux(prompt, refImages, { aspect_ratio: '16:9' })

FLUX.2 Pro supports up to 8 reference images. Priority order matters — put the most important reference first.

Pattern 10: Rate-Limited Parallel Queue (Video/GPU Workloads)

For expensive, slow API calls (video generation ~$0.07 each, ~45s), use an in-memory queue with configurable concurrency instead of sequential processing or unlimited parallelism:

const MAX_CONCURRENT = 3
let activeJobs = 0
const queue = []

function enqueueJob(id) {
  if (queue.some(j => j.id === id)) return // no duplicates
  queue.push({ id })
  processQueue()
}

async function processQueue() {
  while (queue.length > 0 && activeJobs < MAX_CONCURRENT) {
    const job = queue.shift()
    activeJobs++
    processJob(job.id)
      .catch(err => console.error(`Failed ${job.id}:`, err.message))
      .finally(() => { activeJobs--; processQueue() })
  }
}

Key design points: - Skip complete items — only queue pending/failed, never overwrite successful results - Mark status in DB — pending → generating → complete/failed (UI polls this) - Fire-and-forget with finally — each job runs independently, slot freed on completion or failure - No duplicates — check queue before adding - Cost estimation in UI — show ${pendingCount} × $0.07 = $X.XX before starting

Pattern 11: LLM JSON Repair for Truncated Responses

When LLMs generate large structured JSON (especially via fallback providers that may have lower effective token limits), responses can get truncated mid-string. Add a repair function:

function safeParseJSON(text) {
  try {
    return JSON.parse(text)
  } catch (e) {
    console.log('JSON parse failed, attempting repair...')
    let fixed = text
    const lastCloseBrace = fixed.lastIndexOf('}')
    if (lastCloseBrace > 0) {
      fixed = fixed.substring(0, lastCloseBrace + 1)
      if (!fixed.trim().endsWith(']')) fixed = fixed + ']'
    }
    try { return JSON.parse(fixed) }
    catch (e2) { throw new Error(`Invalid JSON from LLM: ${e.message}`) }
  }
}

Also: when LLM needs to return timestamps that match audio segments, compute them from segment indices as a fallback — LLMs frequently omit or miscalculate timestamp fields even when instructed.

Pattern 12: ffmpeg Assembly for AI-Generated Video Clips

AI video models (Wan 2.2, etc.) output clips with inconsistent formats. Normalize before concatenating:

# Normalize to 720p, 24fps, h264, no audio
ffmpeg -y -i input.mp4 \
  -vf "scale=1280:720:force_original_aspect_ratio=decrease,pad=1280:720:(ow-iw)/2:(oh-ih)/2" \
  -r 24 -c:v libx264 -preset fast -crf 23 -an -movflags +faststart output.mp4

# Concat with list file
ffmpeg -y -f concat -safe 0 -i concat.txt -c:v libx264 -preset fast -crf 23 video_only.mp4

# Merge video + audio (use -shortest if durations don't match)
ffmpeg -y -i video_only.mp4 -i narration.mp3 -c:v copy -c:a aac -b:a 192k -shortest final.mp4

Cache downloaded clips locally so re-exports don't re-download from expiring Replicate URLs.

Pattern 13: Image Generation Safety Filter Cascade

AI image generation models have multi-layer safety filtering that can block legitimate creative content (mythology, dramatic scenes, historical violence). Different providers have different filter strictness AND different filter architecture:

The Three Layers (discovered empirically)

Provider-level output filter (e.g., Replicate) — runs AFTER image generation, classifies the output image. Cannot be bypassed by prompt changes. Character names from mythology (Zahhak, Ahriman) can cause the MODEL to generate imagery that triggers this even with perfectly clean prompts.
Provider-level input filter (e.g., fal.ai) — runs BEFORE generation on the prompt text. enable_safety_checker: false only disables the OUTPUT filter. The INPUT filter is separate and cannot be disabled. Blocks certain name+context combinations.
Model-level generation — some models (Qwen) have no safety filtering at all on fal.ai.

Implementation: Three-Tier Cascade

async function generateWithFallback(prompt, options) {
  // Tier 1: Replicate (fast, cheap) — output filter can block
  try {
    return await generateImageFlux(prompt, refs, { safety_tolerance: 5, ...options })
  } catch (err) {
    if (!isSafetyError(err)) throw err
  }

  // Tier 2: fal.ai FLUX (output filter disabled) — input filter can still block
  try {
    return await generateImageFluxFal(prompt, refs, options)
    // Uses enable_safety_checker: false, safety_tolerance: 5
  } catch (err) {
    if (!isSafetyError(err)) throw err
  }

  // Tier 3: Qwen on fal.ai (no safety filter at all) — last resort
  return await generateImageQwenFal(prompt, options)
  // fal-ai/qwen-image-2/text-to-image — no safety params needed
}

function isSafetyError(err) {
  const msg = err?.message || ''
  return msg.includes('flagged as sensitive') || msg.includes('safety') ||
         msg.includes('NSFW') || msg.includes('content_policy_violation')
}

Key Lessons

Never retry the same provider — if Replicate's output filter blocks a prompt at max safety_tolerance, it will block every retry. Switch providers.
Prompt rewriting helps but isn't sufficient — the output classifier flags the generated IMAGE, not the prompt. A perfectly clean prompt about "a king in ceremonial robes" can still produce an image the classifier rejects.
Log which provider succeeded — essential for debugging. "⚡ FLUX blocked, trying Qwen..." tells you the filter pattern.
Save prompts BEFORE generation — if you only save the prompt on success, retry loops can't find failed items to rewrite. This was a critical bug: retry queried WHERE prompt IS NOT NULL but prompt was only saved in the same UPDATE as the result URL.

Pattern 14: Auth Domain Separation and Public Error Sanitization

Some products have multiple authentication domains that look similar from the UI but are not interchangeable at the API layer. Example: a tenant may use openai-codex / ChatGPT subscription device auth for text chat, while voice transcription still calls OpenAI's platform /audio/transcriptions endpoint with an API key. If transcription falls back to a control-plane API key, quota/billing errors belong to that fallback key, not the tenant's subscription.

Implementation rules: - Trace provider credentials per capability (chat, transcription, vision, embeddings) instead of assuming one tenant auth method covers all provider APIs. - Name fallback variables by capability, e.g. TRANSCRIPTION_OPENAI_KEY, so logs/config make the boundary obvious. - Do not expose raw provider quota/billing text to end users. Convert insufficient_quota, billing-plan errors, 402/429 quota text, and provider docs URLs into a product-safe message such as Voice transcription is temporarily unavailable. Text chat still works. - Keep detailed provider errors in server logs with secrets redacted, and add a regression test that asserts public JSON does not include insufficient_quota, provider docs URLs, API keys, or billing internals.

Node helper pattern:

function isOpenAiQuotaError(err) {
  const text = `${err?.status || ''} ${err?.code || ''} ${err?.message || ''} ${err?.response?.data || ''}`.toLowerCase()
  return text.includes('insufficient_quota') ||
    text.includes('quota') ||
    text.includes('billing') ||
    text.includes('usage limits')
}

function publicVoiceError(err) {
  if (isOpenAiQuotaError(err)) {
    return 'Voice transcription is temporarily unavailable. Text chat still works.'
  }
  return 'Voice transcription failed. Please try again or send text instead.'
}

Pattern 15: Quota-Aware Degraded Modes Must Be Honest

A quota fallback can change more than the provider name: it may change identity, background, session duration, model quality, supported commands, or whether the result is conversational at all. Do not normalize away differences that the UI/user must know.

Implementation rules: - Check quota/credits before expensive session or batch creation, but still handle the actual create/start error because balances can change between preflight and use. - Define an explicit acceptable degraded contract. A documented no-credit sandbox may be appropriate for development continuity; a static image is not an equivalent fallback for an interactive avatar. - Return safe runtime metadata with the normalized result, e.g. provider, mode, sandbox, presenter_name, and effective duration/capabilities. - Make the frontend label and render the actual fallback. Never display the primary identity or claim “speaking” when a different sandbox presenter is active or when no provider event occurred. - Verify the fallback end to end independently: token/create, media/data output, user-visible behavior, teardown, and zero/expected credit consumption. - Automatically restore the primary when quota returns, while keeping both paths covered by contract tests.

For realtime-avatar quota, identity, and opaque-background handling, see video-media-production-operations/references/realtime-avatar-web-compositing.md.

Checklist

[ ] Capability-specific auth path is documented (chat vs transcription vs vision, etc.)
[ ] Primary provider call wrapped in try/catch
[ ] Fallback on any error (not just specific status codes)
[ ] Response normalized to primary provider's shape
[ ] Provider-specific model IDs mapped correctly
[ ] OpenRouter requests include app attribution headers when used
[ ] Streaming/SSE routes send explicit provider-error events after headers are sent
[ ] Frontend stream clients show user-visible errors and close disconnected streams
[ ] Stream URLs are composed with URLSearchParams
[ ] Voice/resource IDs mapped between providers
[ ] Pipeline segments are idempotent (skip existing outputs)
[ ] PM2 started with correct --cwd for dotenv
[ ] Costs estimated before batch operations