---
name: video-story-yolo-pipeline
category: software-development
description: Video Story YOLO 10-step pipeline patterns, resilience strategies, and iOS PWA considerations
---

# Video Story YOLO Pipeline

10-step automated video generation pipeline: Voices → Script → Audio → Scenes → Shots → Refs → Frames → Videos → Lip Sync → Export

## Key Patterns

### Progress Bar Feedback
- API returns `yoloStep` (camelCase) in status endpoint
- Frontend polls `/api/projects/:id/status` every 3s when running
- Show live message feed (last 5 messages with timestamps)
- Log verbose messages at each sub-step (e.g., "Scene 3/5: 'The Storm' — analyzing...")
- Include percentage progress for long-running steps (frames, videos)

### LLM Call Resilience
- Per-scene retry (3x) for breakdown-shots endpoint
- If one scene fails, log but continue to next scenes
- Pipeline-level retry (3x) for entire steps
- After retries, validate ALL scenes/shots exist before proceeding
- If validation fails, stop with clear error listing what's missing
- Scene breakdowns run 3 at a time (parallel batches via Promise.all)

### NSFW / Safety Filter Handling (CRITICAL LESSONS)

**Safety tolerance settings:**
- `images.js` generateImageFlux(): `safety_tolerance: 5` (MUST be 5, max permissive for Replicate FLUX.2 Pro, range 1-5)
- `advanced-images.js` generateAdvanced(): default fallback `5` for models with safety_tolerance param
- fal.ai models: `enable_safety_checker: false` disables output filter but NOT input checker

**The 3-tier provider cascade (learned through trial and error):**
1. **Replicate FLUX.2 Pro** (safety_tolerance: 5): Even at max, OUTPUT images get flagged by post-generation classifier. Mythological/dramatic characters (Zahhak, Ahriman) consistently trigger it regardless of prompt content.
2. **fal.ai FLUX.2 Pro** (`enable_safety_checker: false`): Disables output filter. But has separate INPUT prompt content checker that CANNOT be disabled. Blocks some names/themes at the prompt level.
3. **Qwen should prefer EDIT mode with refs for frame fallbacks, not text-to-image.** `fal-ai/qwen-image-2/pro/edit` preserves identity/style better than text-only fallback. Use text-to-image only as the last resort when there are truly no usable refs.

**Important production finding:** Qwen Pro Edit currently errors at 4 refs in production with `Maximum 3 reference images allowed`. Treat the practical cap as **3 refs**, not 4, and trim refs by priority.

**Why retries MUST use fal.ai/Qwen, NOT Replicate:**
Replicate's safety filter is on the OUTPUT. Rewriting the prompt doesn't help — the model generates an image and THEN it gets flagged. Retrying the same provider is pointless. Must switch providers.

**Prompt rewriting with Haiku (`rewriteFlaggedPrompt()` in `llm.js`):**
- Uses Claude 3.5 Haiku via OpenRouter (cheap ~$0.001, fast)
- Analyzes flagged prompts, rewrites preserving visual intent
- Still worth doing — cleaner prompts reduce both input AND output flagging
- Fallback: basic "Safe for work, family friendly, " prefix if Haiku unavailable
- Direct Anthropic API key may be invalid — OpenRouter path is primary

**CRITICAL BUG (fixed): reference_image_prompt must be saved BEFORE generation, not only on success.**
- Root cause: prompt only saved in same UPDATE as URL (on success). When gen failed, prompt stayed NULL. Retry loop queried `WHERE reference_image_prompt IS NOT NULL` → found nothing → skipped.
- Fix: save prompt BEFORE calling generateImageFlux(). Guard with `WHERE reference_image_prompt IS NULL`.
- Retry loop also rebuilds prompt from entity data if somehow still NULL.

**YOLO retry flow (step 6 refs and step 7 frames):**
1. First pass: Replicate FLUX.2 Pro with safety_tolerance: 5 (15 parallel)
2. Poll for completion with stall detection (30s)
3. Find all entities/shots with NULL image URLs (no IS NOT NULL filter!)
4. Rewrite prompts with Haiku AI (currently sequential — should be parallelized)
5. Regenerate via fallback cascade with DB freshness checks:
   a. Check DB — skip if URL already set by background task
   b. fal.ai FLUX (`generateImageFluxFal()`) with safety checker off
   c. For **entity refs**, Qwen text-to-image is acceptable as last resort when identity refs don't exist.
   d. For **shot frames**, prefer **Qwen edit with refs**, not Qwen text-only.
6. Up to 2 retry cycles
7. Validation gate: throw with entity names + "edit in Analysis/Shots tab"

**Critical frame-consistency lesson:** frame retry code must use the SAME ranked reference bundle as the primary generation path. Do not use a weaker retry path.
- Build a shared helper that ranks refs in this order:
  1. continuity ref (for last frame, the first frame)
  2. visible character refs
  3. set ref
  4. prop refs
- Then trim by provider limits (for Qwen edit, use top 3 refs only).
- Never let a last-frame retry fall back to text-only generation if continuity/character refs exist.

**Activity log messages during retry:**
```
⚠️ 3 references failed (likely flagged as sensitive): Zahhak, Fereydun, Ahriman
🔄 Retry 1/2: Rewriting prompts with AI to avoid safety filters...
  ✏️ Rewrote prompt for "Zahhak"
🔄 Regenerating 3 reference images via fal.ai (safety checker disabled)...
  ⚡ fal.ai FLUX blocked "Zahhak", trying Qwen...
  ✓ Generated "Zahhak" via Qwen (fallback)
```

**Upstream safety — reduce flags at prompt generation time:**
- breakdownShots() system prompt includes safety filter compliance instructions
- analyzeStory() appearance field warns about safety filters
- buildCharacterPrompt/buildSetPrompt/buildPropPrompt prepend "Safe for all audiences."

### Validation Gates (pipeline MUST NOT proceed with missing assets)
- After refs (step 6→7): block if ANY character or set refs are NULL
- After frames (step 7→8): block if ANY shots missing first_frame_url
- Error names specific entities and tells user WHERE to fix them
- Old behavior was "moving on" after stall — silently broke downstream steps

### Image Generation Settings (per-project)
- DB columns on `projects`:
  - `aspect_ratio` (TEXT, default `16:9`)
  - `reference_image_model` (TEXT, default `qwen`)
  - `frame_image_model` (TEXT, default `qwen`)
  - legacy `image_mode` may still exist for backward compatibility, but new code should prefer the explicit reference/frame settings.
- Home/create-project flow can set `aspect_ratio` up front (`16:9` or `9:16`).
- Analysis tab should expose three separate controls:
  - Aspect ratio
  - Reference model
  - Frame model
- **Qwen is the default for BOTH references and frames.**
- Supported project-level model registry currently includes:
  - `qwen`
  - `flux`
  - `qwen-pro`
  - `nano-banana`
  - `nano-banana-2`
  - `flux-kontext`
  - `flux-edit`
  - `reve-fast`
- **Qwen mode (default):** No safety filter issues. Uses `fal-ai/qwen-image-2/text-to-image` for refs without guide images and `fal-ai/qwen-image-2/pro/edit` for reference-driven generation.
- **FLUX mode:** Replicate FLUX.2 Pro with `safety_tolerance: 5` and existing fallback paths.
- Advanced project-level models route through `generateAdvanced()`.
- PuLID (face/likeness) generation stays the same regardless of selected reference model.
- Use a shared server-side registry/helper for model metadata and ref limits so refs/frames/manual routes/automation stay in sync.
- Aspect ratio should be passed through all image-generation codepaths rather than hardcoding `16:9`.

### Reference Image Generation
- **Character prompts**: Include full appearance details from entity
- **Set prompts**: "Empty scene with no people, no characters, no figures — location only"
- **Prop prompts**: "Object only — no people, no characters, no hands, no figures"
- **Frame generation**: Collect refs via a shared ranked-reference helper, not ad-hoc arrays sprinkled across routes.
- Frame prompts should include explicit identity anchors, e.g. preserve the exact same character identity, costume, face/fur/hair/colors, and continue from the provided continuity frame when generating a last frame.
- If prompt rewriting is needed for safety, keep identity-control language separate from the rewriteable descriptive/action block so the model doesn't lose the "same character / same scene" anchors.

### iOS PWA Considerations
- Downloads: use navigator.share() with blob, not `<a download>`
- Service worker: `skipWaiting: true` + `clientsClaim: true` for instant updates

### Lip Sync Step (Step 9)
- Only processes dialogue shots (segment_type = 'dialogue' with character_id)
- Extracts per-shot audio from full narration via ffmpeg
- Uses Replicate predictions API with VERSION HASH (not model name!)
  - Kling: version `8311467f...` ($0.014/sec)
  - Sync2: version `3190ef7d...` ($0.05/sec)
- **BUG FIX:** Replicate /v1/predictions requires `version:` not `model:` field
- **Default model should be Sync2, not Kling.** Kling is still useful as an opt-in/secondary model, but raw generated WAN clips are often too small or too short for Kling.
- **Normalize video BEFORE lip sync.** Raw generated clips are commonly `848x480` / `864x480`, and Kling rejects height `< 512`. Export normalization happens too late.
- The normalization step should use the PROJECT aspect ratio dimensions, not a hardcoded landscape size:
  - landscape project → `1280x720`
  - portrait project → `720x1280`
- **Route by constraints before calling Kling.** If a dialogue shot is under ~2s or the normalized clip still doesn't meet Kling constraints, use Sync2 directly.
- **If Kling is explicitly used and fails** for size / invalid-input / temporary-service errors, automatically retry that shot with Sync2.
- Non-blocking: lip sync failures must not prevent export or cause shot loss. Export should continue using `video_url` when `lipsync_url` is missing.
- Export uses COALESCE(lipsync_url, video_url)
- YOLO step name: 'lipsync', icon: 👄

#### Critical production debugging lesson: “stuck on lipsync” may actually mean post-lipsync finalization is hung
When users report YOLO is stuck in the lip-sync phase, do NOT assume the external lipsync model is still running.

Observed production pattern:
1. PM2/out log shows:
   - `[Lipsync] Prediction created: <id>`
   - but never shows `[Lipsync] Saved to S3:` or `[Lipsync] Complete:`
2. DB still shows `lipsync_status='generating'` for that shot
3. The next dialogue shot stays `pending`
4. Direct check of the Replicate prediction shows `status: succeeded`

That means the bottleneck is likely **after Replicate succeeds**, inside app-side result handling.

Most likely hang point in current code:
- `server/lipsync.js` waits for prediction success, then calls `downloadAndUpload(outputUrl, key)`
- `server/storage.js` does a raw `fetch(remoteUrl)` + `await res.arrayBuffer()` with **no timeout**
- if the fetch/download/upload path stalls, the shot remains `generating` forever and the sequential lipsync queue never advances

Read-only investigation workflow that worked:
- inspect PM2 logs for the last `[Lipsync]` lines
- query `/api/projects/:id/lipsync-status` or DB shot rows to see `complete/generating/pending`
- if a prediction ID is present, query the Replicate prediction directly using the configured token
- compare Replicate status vs DB status:
  - Replicate `succeeded` + DB still `generating` => app-side post-processing/finalization hang
  - Replicate still `starting/processing` => model/provider slowness

Useful cues:
- `lipsyncAll()` is sequential by design, so one stuck shot blocks all later dialogue shots
- project `yolo_step='lipsync'` can persist even after the provider finished if local finalization never completes
- local temp artifacts in `uploads/<project>/lipsync/` can prove audio extraction + video normalization already happened before the stall

### Frontend Navigation Resilience
- Server sets project.status = 'generating'/'analyzing' during long ops
- Frontend detects server status (not just local state) to show progress
- `const serverGenerating = project.status === 'generating'`
- `const generating = localGenerating || serverGenerating`
- Polls every 3s while server status indicates in-progress
- Process continues server-side regardless of frontend navigation
- Progress indicator at TOP of component (above project info)
- Message: "You can navigate away — generation continues in the background"

### Hermes automation API
- For Telegram/Hermes orchestration, add authenticated local endpoints:
  - `POST /api/hermes/projects/create-and-run`
  - `GET /api/hermes/projects/:id/status`
  - `GET /api/hermes/projects/:id/export`
- Auth pattern can be lightweight for local VPS use: `x-hermes-token` header validated against `VIDEO_STORY_HERMES_TOKEN`, then `VIDEO_STORY_PIN`, then app PIN fallback if needed.
- `create-and-run` should accept:
  - `prompt`
  - `duration_target`
  - `genre`
  - `style`
  - `aspect_ratio`
  - `reference_image_model`
  - `frame_image_model`
  - `auto_yolo`
- `status` should return:
  - project row
  - current yolo step
  - recent activity logs
  - export readiness + URLs
- This API is the cleanest bridge for Hermes skills/Telegram automation without browser-driving the app.

### Cost Estimation
- Progressive: rough estimates pre-story, exact post-shots
- All modes include 👄 Lip sync (~30% of shots estimated as dialogue)
- Post-shots mode uses exact estimateLipsyncCost() + subtracts generated assets
- Pricing: FLUX $0.05/img, WAN $0.07/video, TTS $0.03/1K chars, LLM $0.04/call, Lipsync $0.04/shot
- Qwen fallback is actually cheaper than FLUX ($0.035 vs $0.05)
- Qwen Pro Edit (with refs): $0.075/img
- LLM call estimate of $0.04 underestimates large calls (analyzeStory ~$0.12). Scene breakdowns are PER-SCENE (not flat).

## Debugging Checklist
1. Progress bar stuck? Check yoloStep field name mismatch
2. NSFW flags? safety_tolerance=5 is max for Replicate. If still failing → OUTPUT filtering → fal.ai/Qwen fallback
3. Retry loop not executing? Check reference_image_prompt saved BEFORE generation
4. Lipsync failing with "version is required"? Use version hash, not model name
5. Process "stops" on navigation? Frontend losing state — check server status polling
6. Double regeneration? Background parallelMap still running — add DB freshness check
7. Stale UI? SW cache — close all tabs, reopen PWA

## File Locations
- Pipeline (YOLO logic): `server/index.js` (search "YOLO MODE")
- LLM + prompt rewriting: `server/llm.js` (`rewriteFlaggedPrompt()`)
- Images: `server/images.js` (`generateImageFlux`, `generateImageFluxFal`, `generateImageQwenFal`)
- Advanced images: `server/advanced-images.js`
- Lip sync: `server/lipsync.js` (version hashes in LIPSYNC_MODELS)
- Video: `server/video.js`
- Export: `server/export.js`
- Cost estimation: `server/index.js` (search "COST ESTIMATION")
- Frontend: `src/components/StoryPhase.jsx`, `AnalysisPhase.jsx`, `YoloStatus.jsx`