--- name: video-story-yolo-pipeline category: software-development description: Video Story YOLO 10-step pipeline patterns, resilience strategies, and iOS PWA considerations --- # Video Story YOLO Pipeline 10-step automated video generation pipeline: Voices → Script → Audio → Scenes → Shots → Refs → Frames → Videos → Lip Sync → Export ## Key Patterns ### Progress Bar Feedback - API returns `yoloStep` (camelCase) in status endpoint - Frontend polls `/api/projects/:id/status` every 3s when running - Show live message feed (last 5 messages with timestamps) - Log verbose messages at each sub-step (e.g., "Scene 3/5: 'The Storm' — analyzing...") - Include percentage progress for long-running steps (frames, videos) ### LLM Call Resilience - Per-scene retry (3x) for breakdown-shots endpoint - If one scene fails, log but continue to next scenes - Pipeline-level retry (3x) for entire steps - After retries, validate ALL scenes/shots exist before proceeding - If validation fails, stop with clear error listing what's missing - Scene breakdowns run 3 at a time (parallel batches via Promise.all) ### NSFW / Safety Filter Handling (CRITICAL LESSONS) **Safety tolerance settings:** - `images.js` generateImageFlux(): `safety_tolerance: 5` (MUST be 5, max permissive for Replicate FLUX.2 Pro, range 1-5) - `advanced-images.js` generateAdvanced(): default fallback `5` for models with safety_tolerance param - fal.ai models: `enable_safety_checker: false` disables output filter but NOT input checker **The 3-tier provider cascade (learned through trial and error):** 1. **Replicate FLUX.2 Pro** (safety_tolerance: 5): Even at max, OUTPUT images get flagged by post-generation classifier. Mythological/dramatic characters (Zahhak, Ahriman) consistently trigger it regardless of prompt content. 2. **fal.ai FLUX.2 Pro** (`enable_safety_checker: false`): Disables output filter. But has separate INPUT prompt content checker that CANNOT be disabled. Blocks some names/themes at the prompt level. 3. **Qwen should prefer EDIT mode with refs for frame fallbacks, not text-to-image.** `fal-ai/qwen-image-2/pro/edit` preserves identity/style better than text-only fallback. Use text-to-image only as the last resort when there are truly no usable refs. **Important production finding:** Qwen Pro Edit currently errors at 4 refs in production with `Maximum 3 reference images allowed`. Treat the practical cap as **3 refs**, not 4, and trim refs by priority. **Why retries MUST use fal.ai/Qwen, NOT Replicate:** Replicate's safety filter is on the OUTPUT. Rewriting the prompt doesn't help — the model generates an image and THEN it gets flagged. Retrying the same provider is pointless. Must switch providers. **Prompt rewriting with Haiku (`rewriteFlaggedPrompt()` in `llm.js`):** - Uses Claude 3.5 Haiku via OpenRouter (cheap ~$0.001, fast) - Analyzes flagged prompts, rewrites preserving visual intent - Still worth doing — cleaner prompts reduce both input AND output flagging - Fallback: basic "Safe for work, family friendly, " prefix if Haiku unavailable - Direct Anthropic API key may be invalid — OpenRouter path is primary **CRITICAL BUG (fixed): reference_image_prompt must be saved BEFORE generation, not only on success.** - Root cause: prompt only saved in same UPDATE as URL (on success). When gen failed, prompt stayed NULL. Retry loop queried `WHERE reference_image_prompt IS NOT NULL` → found nothing → skipped. - Fix: save prompt BEFORE calling generateImageFlux(). Guard with `WHERE reference_image_prompt IS NULL`. - Retry loop also rebuilds prompt from entity data if somehow still NULL. **YOLO retry flow (step 6 refs and step 7 frames):** 1. First pass: Replicate FLUX.2 Pro with safety_tolerance: 5 (15 parallel) 2. Poll for completion with stall detection (30s) 3. Find all entities/shots with NULL image URLs (no IS NOT NULL filter!) 4. Rewrite prompts with Haiku AI (currently sequential — should be parallelized) 5. Regenerate via fallback cascade with DB freshness checks: a. Check DB — skip if URL already set by background task b. fal.ai FLUX (`generateImageFluxFal()`) with safety checker off c. For **entity refs**, Qwen text-to-image is acceptable as last resort when identity refs don't exist. d. For **shot frames**, prefer **Qwen edit with refs**, not Qwen text-only. 6. Up to 2 retry cycles 7. Validation gate: throw with entity names + "edit in Analysis/Shots tab" **Critical frame-consistency lesson:** frame retry code must use the SAME ranked reference bundle as the primary generation path. Do not use a weaker retry path. - Build a shared helper that ranks refs in this order: 1. continuity ref (for last frame, the first frame) 2. visible character refs 3. set ref 4. prop refs - Then trim by provider limits (for Qwen edit, use top 3 refs only). - Never let a last-frame retry fall back to text-only generation if continuity/character refs exist. **Activity log messages during retry:** ``` ⚠️ 3 references failed (likely flagged as sensitive): Zahhak, Fereydun, Ahriman 🔄 Retry 1/2: Rewriting prompts with AI to avoid safety filters... ✏️ Rewrote prompt for "Zahhak" 🔄 Regenerating 3 reference images via fal.ai (safety checker disabled)... ⚡ fal.ai FLUX blocked "Zahhak", trying Qwen... ✓ Generated "Zahhak" via Qwen (fallback) ``` **Upstream safety — reduce flags at prompt generation time:** - breakdownShots() system prompt includes safety filter compliance instructions - analyzeStory() appearance field warns about safety filters - buildCharacterPrompt/buildSetPrompt/buildPropPrompt prepend "Safe for all audiences." ### Validation Gates (pipeline MUST NOT proceed with missing assets) - After refs (step 6→7): block if ANY character or set refs are NULL - After frames (step 7→8): block if ANY shots missing first_frame_url - Error names specific entities and tells user WHERE to fix them - Old behavior was "moving on" after stall — silently broke downstream steps ### Image Generation Settings (per-project) - DB columns on `projects`: - `aspect_ratio` (TEXT, default `16:9`) - `reference_image_model` (TEXT, default `qwen`) - `frame_image_model` (TEXT, default `qwen`) - legacy `image_mode` may still exist for backward compatibility, but new code should prefer the explicit reference/frame settings. - Home/create-project flow can set `aspect_ratio` up front (`16:9` or `9:16`). - Analysis tab should expose three separate controls: - Aspect ratio - Reference model - Frame model - **Qwen is the default for BOTH references and frames.** - Supported project-level model registry currently includes: - `qwen` - `flux` - `qwen-pro` - `nano-banana` - `nano-banana-2` - `flux-kontext` - `flux-edit` - `reve-fast` - **Qwen mode (default):** No safety filter issues. Uses `fal-ai/qwen-image-2/text-to-image` for refs without guide images and `fal-ai/qwen-image-2/pro/edit` for reference-driven generation. - **FLUX mode:** Replicate FLUX.2 Pro with `safety_tolerance: 5` and existing fallback paths. - Advanced project-level models route through `generateAdvanced()`. - PuLID (face/likeness) generation stays the same regardless of selected reference model. - Use a shared server-side registry/helper for model metadata and ref limits so refs/frames/manual routes/automation stay in sync. - Aspect ratio should be passed through all image-generation codepaths rather than hardcoding `16:9`. ### Reference Image Generation - **Character prompts**: Include full appearance details from entity - **Set prompts**: "Empty scene with no people, no characters, no figures — location only" - **Prop prompts**: "Object only — no people, no characters, no hands, no figures" - **Frame generation**: Collect refs via a shared ranked-reference helper, not ad-hoc arrays sprinkled across routes. - Frame prompts should include explicit identity anchors, e.g. preserve the exact same character identity, costume, face/fur/hair/colors, and continue from the provided continuity frame when generating a last frame. - If prompt rewriting is needed for safety, keep identity-control language separate from the rewriteable descriptive/action block so the model doesn't lose the "same character / same scene" anchors. ### iOS PWA Considerations - Downloads: use navigator.share() with blob, not `` - Service worker: `skipWaiting: true` + `clientsClaim: true` for instant updates ### Lip Sync Step (Step 9) - Only processes dialogue shots (segment_type = 'dialogue' with character_id) - Extracts per-shot audio from full narration via ffmpeg - Uses Replicate predictions API with VERSION HASH (not model name!) - Kling: version `8311467f...` ($0.014/sec) - Sync2: version `3190ef7d...` ($0.05/sec) - **BUG FIX:** Replicate /v1/predictions requires `version:` not `model:` field - **Default model should be Sync2, not Kling.** Kling is still useful as an opt-in/secondary model, but raw generated WAN clips are often too small or too short for Kling. - **Normalize video BEFORE lip sync.** Raw generated clips are commonly `848x480` / `864x480`, and Kling rejects height `< 512`. Export normalization happens too late. - The normalization step should use the PROJECT aspect ratio dimensions, not a hardcoded landscape size: - landscape project → `1280x720` - portrait project → `720x1280` - **Route by constraints before calling Kling.** If a dialogue shot is under ~2s or the normalized clip still doesn't meet Kling constraints, use Sync2 directly. - **If Kling is explicitly used and fails** for size / invalid-input / temporary-service errors, automatically retry that shot with Sync2. - Non-blocking: lip sync failures must not prevent export or cause shot loss. Export should continue using `video_url` when `lipsync_url` is missing. - Export uses COALESCE(lipsync_url, video_url) - YOLO step name: 'lipsync', icon: 👄 #### Critical production debugging lesson: “stuck on lipsync” may actually mean post-lipsync finalization is hung When users report YOLO is stuck in the lip-sync phase, do NOT assume the external lipsync model is still running. Observed production pattern: 1. PM2/out log shows: - `[Lipsync] Prediction created: ` - but never shows `[Lipsync] Saved to S3:` or `[Lipsync] Complete:` 2. DB still shows `lipsync_status='generating'` for that shot 3. The next dialogue shot stays `pending` 4. Direct check of the Replicate prediction shows `status: succeeded` That means the bottleneck is likely **after Replicate succeeds**, inside app-side result handling. Most likely hang point in current code: - `server/lipsync.js` waits for prediction success, then calls `downloadAndUpload(outputUrl, key)` - `server/storage.js` does a raw `fetch(remoteUrl)` + `await res.arrayBuffer()` with **no timeout** - if the fetch/download/upload path stalls, the shot remains `generating` forever and the sequential lipsync queue never advances Read-only investigation workflow that worked: - inspect PM2 logs for the last `[Lipsync]` lines - query `/api/projects/:id/lipsync-status` or DB shot rows to see `complete/generating/pending` - if a prediction ID is present, query the Replicate prediction directly using the configured token - compare Replicate status vs DB status: - Replicate `succeeded` + DB still `generating` => app-side post-processing/finalization hang - Replicate still `starting/processing` => model/provider slowness Useful cues: - `lipsyncAll()` is sequential by design, so one stuck shot blocks all later dialogue shots - project `yolo_step='lipsync'` can persist even after the provider finished if local finalization never completes - local temp artifacts in `uploads//lipsync/` can prove audio extraction + video normalization already happened before the stall ### Frontend Navigation Resilience - Server sets project.status = 'generating'/'analyzing' during long ops - Frontend detects server status (not just local state) to show progress - `const serverGenerating = project.status === 'generating'` - `const generating = localGenerating || serverGenerating` - Polls every 3s while server status indicates in-progress - Process continues server-side regardless of frontend navigation - Progress indicator at TOP of component (above project info) - Message: "You can navigate away — generation continues in the background" ### Hermes automation API - For Telegram/Hermes orchestration, add authenticated local endpoints: - `POST /api/hermes/projects/create-and-run` - `GET /api/hermes/projects/:id/status` - `GET /api/hermes/projects/:id/export` - Auth pattern can be lightweight for local VPS use: `x-hermes-token` header validated against `VIDEO_STORY_HERMES_TOKEN`, then `VIDEO_STORY_PIN`, then app PIN fallback if needed. - `create-and-run` should accept: - `prompt` - `duration_target` - `genre` - `style` - `aspect_ratio` - `reference_image_model` - `frame_image_model` - `auto_yolo` - `status` should return: - project row - current yolo step - recent activity logs - export readiness + URLs - This API is the cleanest bridge for Hermes skills/Telegram automation without browser-driving the app. ### Cost Estimation - Progressive: rough estimates pre-story, exact post-shots - All modes include 👄 Lip sync (~30% of shots estimated as dialogue) - Post-shots mode uses exact estimateLipsyncCost() + subtracts generated assets - Pricing: FLUX $0.05/img, WAN $0.07/video, TTS $0.03/1K chars, LLM $0.04/call, Lipsync $0.04/shot - Qwen fallback is actually cheaper than FLUX ($0.035 vs $0.05) - Qwen Pro Edit (with refs): $0.075/img - LLM call estimate of $0.04 underestimates large calls (analyzeStory ~$0.12). Scene breakdowns are PER-SCENE (not flat). ## Debugging Checklist 1. Progress bar stuck? Check yoloStep field name mismatch 2. NSFW flags? safety_tolerance=5 is max for Replicate. If still failing → OUTPUT filtering → fal.ai/Qwen fallback 3. Retry loop not executing? Check reference_image_prompt saved BEFORE generation 4. Lipsync failing with "version is required"? Use version hash, not model name 5. Process "stops" on navigation? Frontend losing state — check server status polling 6. Double regeneration? Background parallelMap still running — add DB freshness check 7. Stale UI? SW cache — close all tabs, reopen PWA ## File Locations - Pipeline (YOLO logic): `server/index.js` (search "YOLO MODE") - LLM + prompt rewriting: `server/llm.js` (`rewriteFlaggedPrompt()`) - Images: `server/images.js` (`generateImageFlux`, `generateImageFluxFal`, `generateImageQwenFal`) - Advanced images: `server/advanced-images.js` - Lip sync: `server/lipsync.js` (version hashes in LIPSYNC_MODELS) - Video: `server/video.js` - Export: `server/export.js` - Cost estimation: `server/index.js` (search "COST ESTIMATION") - Frontend: `src/components/StoryPhase.jsx`, `AnalysisPhase.jsx`, `YoloStatus.jsx`