video-story-yolo-pipeline

/home/avalon/.hermes/skills/.archive/software-development/video-story-yolo-pipeline/SKILL.md · raw

Video Story YOLO Pipeline

10-step automated video generation pipeline: Voices → Script → Audio → Scenes → Shots → Refs → Frames → Videos → Lip Sync → Export

Key Patterns

Progress Bar Feedback

API returns yoloStep (camelCase) in status endpoint
Frontend polls /api/projects/:id/status every 3s when running
Show live message feed (last 5 messages with timestamps)
Log verbose messages at each sub-step (e.g., "Scene 3/5: 'The Storm' — analyzing...")
Include percentage progress for long-running steps (frames, videos)

LLM Call Resilience

Per-scene retry (3x) for breakdown-shots endpoint
If one scene fails, log but continue to next scenes
Pipeline-level retry (3x) for entire steps
After retries, validate ALL scenes/shots exist before proceeding
If validation fails, stop with clear error listing what's missing
Scene breakdowns run 3 at a time (parallel batches via Promise.all)

NSFW / Safety Filter Handling (CRITICAL LESSONS)

Safety tolerance settings: - images.js generateImageFlux(): safety_tolerance: 5 (MUST be 5, max permissive for Replicate FLUX.2 Pro, range 1-5) - advanced-images.js generateAdvanced(): default fallback 5 for models with safety_tolerance param - fal.ai models: enable_safety_checker: false disables output filter but NOT input checker

The 3-tier provider cascade (learned through trial and error): 1. Replicate FLUX.2 Pro (safety_tolerance: 5): Even at max, OUTPUT images get flagged by post-generation classifier. Mythological/dramatic characters (Zahhak, Ahriman) consistently trigger it regardless of prompt content. 2. fal.ai FLUX.2 Pro (enable_safety_checker: false): Disables output filter. But has separate INPUT prompt content checker that CANNOT be disabled. Blocks some names/themes at the prompt level. 3. Qwen should prefer EDIT mode with refs for frame fallbacks, not text-to-image. fal-ai/qwen-image-2/pro/edit preserves identity/style better than text-only fallback. Use text-to-image only as the last resort when there are truly no usable refs.

Important production finding: Qwen Pro Edit currently errors at 4 refs in production with Maximum 3 reference images allowed. Treat the practical cap as 3 refs, not 4, and trim refs by priority.

Why retries MUST use fal.ai/Qwen, NOT Replicate: Replicate's safety filter is on the OUTPUT. Rewriting the prompt doesn't help — the model generates an image and THEN it gets flagged. Retrying the same provider is pointless. Must switch providers.

Prompt rewriting with Haiku (rewriteFlaggedPrompt() in llm.js): - Uses Claude 3.5 Haiku via OpenRouter (cheap ~$0.001, fast) - Analyzes flagged prompts, rewrites preserving visual intent - Still worth doing — cleaner prompts reduce both input AND output flagging - Fallback: basic "Safe for work, family friendly, " prefix if Haiku unavailable - Direct Anthropic API key may be invalid — OpenRouter path is primary

CRITICAL BUG (fixed): reference_image_prompt must be saved BEFORE generation, not only on success. - Root cause: prompt only saved in same UPDATE as URL (on success). When gen failed, prompt stayed NULL. Retry loop queried WHERE reference_image_prompt IS NOT NULL → found nothing → skipped. - Fix: save prompt BEFORE calling generateImageFlux(). Guard with WHERE reference_image_prompt IS NULL. - Retry loop also rebuilds prompt from entity data if somehow still NULL.

YOLO retry flow (step 6 refs and step 7 frames): 1. First pass: Replicate FLUX.2 Pro with safety_tolerance: 5 (15 parallel) 2. Poll for completion with stall detection (30s) 3. Find all entities/shots with NULL image URLs (no IS NOT NULL filter!) 4. Rewrite prompts with Haiku AI (currently sequential — should be parallelized) 5. Regenerate via fallback cascade with DB freshness checks: a. Check DB — skip if URL already set by background task b. fal.ai FLUX (generateImageFluxFal()) with safety checker off c. For entity refs, Qwen text-to-image is acceptable as last resort when identity refs don't exist. d. For shot frames, prefer Qwen edit with refs, not Qwen text-only. 6. Up to 2 retry cycles 7. Validation gate: throw with entity names + "edit in Analysis/Shots tab"

Critical frame-consistency lesson: frame retry code must use the SAME ranked reference bundle as the primary generation path. Do not use a weaker retry path. - Build a shared helper that ranks refs in this order: 1. continuity ref (for last frame, the first frame) 2. visible character refs 3. set ref 4. prop refs - Then trim by provider limits (for Qwen edit, use top 3 refs only). - Never let a last-frame retry fall back to text-only generation if continuity/character refs exist.

Activity log messages during retry:

⚠️ 3 references failed (likely flagged as sensitive): Zahhak, Fereydun, Ahriman
🔄 Retry 1/2: Rewriting prompts with AI to avoid safety filters...
  ✏️ Rewrote prompt for "Zahhak"
🔄 Regenerating 3 reference images via fal.ai (safety checker disabled)...
  ⚡ fal.ai FLUX blocked "Zahhak", trying Qwen...
  ✓ Generated "Zahhak" via Qwen (fallback)

Upstream safety — reduce flags at prompt generation time: - breakdownShots() system prompt includes safety filter compliance instructions - analyzeStory() appearance field warns about safety filters - buildCharacterPrompt/buildSetPrompt/buildPropPrompt prepend "Safe for all audiences."

Validation Gates (pipeline MUST NOT proceed with missing assets)

After refs (step 6→7): block if ANY character or set refs are NULL
After frames (step 7→8): block if ANY shots missing first_frame_url
Error names specific entities and tells user WHERE to fix them
Old behavior was "moving on" after stall — silently broke downstream steps

Image Generation Settings (per-project)

DB columns on projects:
aspect_ratio (TEXT, default 16:9)
reference_image_model (TEXT, default qwen)
frame_image_model (TEXT, default qwen)
legacy image_mode may still exist for backward compatibility, but new code should prefer the explicit reference/frame settings.
Home/create-project flow can set aspect_ratio up front (16:9 or 9:16).
Analysis tab should expose three separate controls:
Aspect ratio
Reference model
Frame model
Qwen is the default for BOTH references and frames.
Supported project-level model registry currently includes:
qwen
flux
qwen-pro
nano-banana
nano-banana-2
flux-kontext
flux-edit
reve-fast
Qwen mode (default): No safety filter issues. Uses fal-ai/qwen-image-2/text-to-image for refs without guide images and fal-ai/qwen-image-2/pro/edit for reference-driven generation.
FLUX mode: Replicate FLUX.2 Pro with safety_tolerance: 5 and existing fallback paths.
Advanced project-level models route through generateAdvanced().
PuLID (face/likeness) generation stays the same regardless of selected reference model.
Use a shared server-side registry/helper for model metadata and ref limits so refs/frames/manual routes/automation stay in sync.
Aspect ratio should be passed through all image-generation codepaths rather than hardcoding 16:9.

Reference Image Generation

Character prompts: Include full appearance details from entity
Set prompts: "Empty scene with no people, no characters, no figures — location only"
Prop prompts: "Object only — no people, no characters, no hands, no figures"
Frame generation: Collect refs via a shared ranked-reference helper, not ad-hoc arrays sprinkled across routes.
Frame prompts should include explicit identity anchors, e.g. preserve the exact same character identity, costume, face/fur/hair/colors, and continue from the provided continuity frame when generating a last frame.
If prompt rewriting is needed for safety, keep identity-control language separate from the rewriteable descriptive/action block so the model doesn't lose the "same character / same scene" anchors.

iOS PWA Considerations

Downloads: use navigator.share() with blob, not <a download>
Service worker: skipWaiting: true + clientsClaim: true for instant updates

Lip Sync Step (Step 9)

Only processes dialogue shots (segment_type = 'dialogue' with character_id)
Extracts per-shot audio from full narration via ffmpeg
Uses Replicate predictions API with VERSION HASH (not model name!)
Kling: version 8311467f... ($0.014/sec)
Sync2: version 3190ef7d... ($0.05/sec)
BUG FIX: Replicate /v1/predictions requires version: not model: field
Default model should be Sync2, not Kling. Kling is still useful as an opt-in/secondary model, but raw generated WAN clips are often too small or too short for Kling.
Normalize video BEFORE lip sync. Raw generated clips are commonly 848x480 / 864x480, and Kling rejects height < 512. Export normalization happens too late.
The normalization step should use the PROJECT aspect ratio dimensions, not a hardcoded landscape size:
landscape project → 1280x720
portrait project → 720x1280
Route by constraints before calling Kling. If a dialogue shot is under ~2s or the normalized clip still doesn't meet Kling constraints, use Sync2 directly.
If Kling is explicitly used and fails for size / invalid-input / temporary-service errors, automatically retry that shot with Sync2.
Non-blocking: lip sync failures must not prevent export or cause shot loss. Export should continue using video_url when lipsync_url is missing.
Export uses COALESCE(lipsync_url, video_url)
YOLO step name: 'lipsync', icon: 👄

Critical production debugging lesson: “stuck on lipsync” may actually mean post-lipsync finalization is hung

When users report YOLO is stuck in the lip-sync phase, do NOT assume the external lipsync model is still running.

Observed production pattern: 1. PM2/out log shows: - [Lipsync] Prediction created: <id> - but never shows [Lipsync] Saved to S3: or [Lipsync] Complete: 2. DB still shows lipsync_status='generating' for that shot 3. The next dialogue shot stays pending 4. Direct check of the Replicate prediction shows status: succeeded

That means the bottleneck is likely after Replicate succeeds, inside app-side result handling.

Most likely hang point in current code: - server/lipsync.js waits for prediction success, then calls downloadAndUpload(outputUrl, key) - server/storage.js does a raw fetch(remoteUrl) + await res.arrayBuffer() with no timeout - if the fetch/download/upload path stalls, the shot remains generating forever and the sequential lipsync queue never advances

Read-only investigation workflow that worked: - inspect PM2 logs for the last [Lipsync] lines - query /api/projects/:id/lipsync-status or DB shot rows to see complete/generating/pending - if a prediction ID is present, query the Replicate prediction directly using the configured token - compare Replicate status vs DB status: - Replicate succeeded + DB still generating => app-side post-processing/finalization hang - Replicate still starting/processing => model/provider slowness

Useful cues: - lipsyncAll() is sequential by design, so one stuck shot blocks all later dialogue shots - project yolo_step='lipsync' can persist even after the provider finished if local finalization never completes - local temp artifacts in uploads/<project>/lipsync/ can prove audio extraction + video normalization already happened before the stall

Server sets project.status = 'generating'/'analyzing' during long ops
Frontend detects server status (not just local state) to show progress
const serverGenerating = project.status === 'generating'
const generating = localGenerating || serverGenerating
Polls every 3s while server status indicates in-progress
Process continues server-side regardless of frontend navigation
Progress indicator at TOP of component (above project info)
Message: "You can navigate away — generation continues in the background"

Hermes automation API

For Telegram/Hermes orchestration, add authenticated local endpoints:
POST /api/hermes/projects/create-and-run
GET /api/hermes/projects/:id/status
GET /api/hermes/projects/:id/export
Auth pattern can be lightweight for local VPS use: x-hermes-token header validated against VIDEO_STORY_HERMES_TOKEN, then VIDEO_STORY_PIN, then app PIN fallback if needed.
create-and-run should accept:
prompt
duration_target
genre
style
aspect_ratio
reference_image_model
frame_image_model
auto_yolo
status should return:
project row
current yolo step
recent activity logs
export readiness + URLs
This API is the cleanest bridge for Hermes skills/Telegram automation without browser-driving the app.

Cost Estimation

Progressive: rough estimates pre-story, exact post-shots
All modes include 👄 Lip sync (~30% of shots estimated as dialogue)
Post-shots mode uses exact estimateLipsyncCost() + subtracts generated assets
Pricing: FLUX $0.05/img, WAN $0.07/video, TTS $0.03/1K chars, LLM $0.04/call, Lipsync $0.04/shot
Qwen fallback is actually cheaper than FLUX ($0.035 vs $0.05)
Qwen Pro Edit (with refs): $0.075/img
LLM call estimate of $0.04 underestimates large calls (analyzeStory ~$0.12). Scene breakdowns are PER-SCENE (not flat).

Debugging Checklist

Progress bar stuck? Check yoloStep field name mismatch
NSFW flags? safety_tolerance=5 is max for Replicate. If still failing → OUTPUT filtering → fal.ai/Qwen fallback
Retry loop not executing? Check reference_image_prompt saved BEFORE generation
Lipsync failing with "version is required"? Use version hash, not model name
Process "stops" on navigation? Frontend losing state — check server status polling
Double regeneration? Background parallelMap still running — add DB freshness check
Stale UI? SW cache — close all tabs, reopen PWA

File Locations

Pipeline (YOLO logic): server/index.js (search "YOLO MODE")
LLM + prompt rewriting: server/llm.js (rewriteFlaggedPrompt())
Images: server/images.js (generateImageFlux, generateImageFluxFal, generateImageQwenFal)
Advanced images: server/advanced-images.js
Lip sync: server/lipsync.js (version hashes in LIPSYNC_MODELS)
Video: server/video.js
Export: server/export.js
Cost estimation: server/index.js (search "COST ESTIMATION")
Frontend: src/components/StoryPhase.jsx, AnalysisPhase.jsx, YoloStatus.jsx