--- name: ai-video-story-pipeline description: "Architecture for Alex's AI video story pipeline (video-story app) — YOLO pipeline, reference image conventions, PWA patterns, and LLM integration." tags: [video-story, pwa, replicate, flux, openrouter] --- # AI Video Story Pipeline ## App Overview - **URL**: video-story.apps.poofc.com - **Path**: /home/avalon/apps/video-story - **Stack**: React + Vite PWA (frontend), Express + SQLite (backend) - **PM2**: video-story process ## LLM Configuration - **Primary**: OpenRouter (routes to Anthropic, better rate limits) - **Fallback**: Direct Anthropic API - Never use Anthropic as primary — causes 429 rate limits during batch operations ## YOLO Pipeline (10 steps) 1. **Voices** — Assign voices to characters 2. **Script** — Generate script from story 3. **Audio** — Generate audio narration 4. **Scenes** — Break story into scenes 5. **Shots** — Break each scene into shots (per-scene retry, 3 attempts each) 6. **Refs** — Generate reference images for characters, sets, props 7. **Frames** — Generate frame images for each shot (uses FLUX.2) 8. **Videos** — Generate video clips from frames (uses WAN 2.2 via Replicate) 9. **Lip Sync** — Lip sync dialogue shots (Kling via Replicate, ~$0.014/sec) 10. **Export** — Assemble final video (prefers lip-synced clips when available) ## Critical Conventions ### Story Generation - Story prompts MUST exclude character appearance details (hair, clothing, build, age, skin) - Appearance is handled separately in the Analysis step's `appearance` field - This prevents conflicts when user edits appearance in detail panel ### Reference Images - **Set references**: Must exclude all people/characters — empty environment only - **Prop references**: Must exclude all people/characters — object only - **Character references**: Character portrait with appearance details; PuLID used when guide images exist - Frame generation uses ALL reference types: character + set + prop images (via scene_props relationship) - FLUX.2 supports up to 8 reference images per generation ### Guide Images (multi-upload system) - `guide_images` table stores multiple guide photos per entity (entity_type, entity_id, image_url, sort_order) - Legacy `guide_image_url` column on characters/sets/props stays in sync (first guide image) - API: POST/GET `/:entityType/:id/guide-images`, DELETE `/guide-images/:guideId` - Entity list endpoints return `guide_images` array attached to each entity via `attachGuideImages()` helper - All generate endpoints (standard, advanced, YOLO bulk) query `getGuideImageUrls()` from guide_images table - Characters with guides use PuLID (identity-preserving, 1 ref); sets/props pass all guides as FLUX refs - Model maxRefs vary by provider reality, not preference: PuLID=1, Reve/Qwen/Kontext=1, Nano Banana=4, FLUX.2 Edit=9, Nano Banana 2/Pro=14. Qwen Pro Edit currently must be capped at 3 refs because Fal returned 422 “Maximum 3 reference images allowed”; do not raise it without live docs/testing. - Guide images beyond model capacity are visually faded and not sent to API - CRITICAL regeneration rule: advanced regeneration must use guide images + explicit extra refs only; do NOT silently fall back to the existing AI-generated `reference_image_url` unless the user explicitly requests an "edit existing reference" behavior ### AI Image Editing / Regeneration UX Parity - Treat AI image generation/editing/regeneration as long-running jobs with persistent user-visible state, not as button-local spinners. If a user refreshes or reopens the PWA, active and recent jobs should rehydrate from SQLite/server status. - Users must be able to run multiple image jobs in parallel. Do not globally disable all generate/edit buttons while one job runs; show a compact activity/log panel with all active jobs, statuses, errors, and completions. - Every uploaded, generated, and edited image should automatically enter the image-analysis/enrichment path. If analysis cannot run, show a clear “Needs setup”/provider issue rather than silently omitting analysis or pretending a generic caption is enough. - Editing/regenerating from an existing image must actually send the source/guide/reference image(s) to the provider. Before changing prompts, inspect the provider payload and model reference caps (`image_url` vs `image_urls`, PuLID single ref, Qwen caps, etc.). - UI feedback should happen where the user is looking: on the placeholder/preview or active card, with logs collapsible nearby. Button placement should support the flow: prompt/model controls below the preview, not detached from the image being edited. ### Analysis Phase UI (single-column cards) - Single-column layout (not grid) — gives each entity a full-width hero image - Each card: hero reference image, entity type pill, status badge (Reference/Guides added/No images) - Guide image thumbnail strip with inline + button directly on cards (no drawer needed to upload) - Footer action: "Set Up & Generate" or "Edit & Regenerate" opens detail panel - `fileInputRefs` use a ref object keyed by `${type}_${id}` for per-card file inputs ### Detail Panel (layout order — no accordion/advanced section) 1. **Guide Images** (top, most prominent) — multi-upload grid with always-visible delete badges, big empty-state CTA - Shows active vs overflow guides based on selected model's maxRefs - Overflow guides: faded, greyscale, "unused" overlay - Warning message with suggestion to switch models for more refs 2. **Current Reference Image** — display only (if exists), with both "upload your own" and **clear current reference** actions - Clearing the current reference should set `reference_image_url = NULL` without deleting guide images - Endpoint pattern: `DELETE /api/:entityType/:id/reference-image` 3. **Detail Fields** — name, description, appearance, personality etc. 4. **Model + Generate** (bottom of form) — model dropdown + single generate button - Model selector label: always "Default (FLUX.2 Pro)" — never expose PuLID to user (implementation detail) - Generate button shows: model name, price, guide count - NO separate prompt textarea in the default detail flow — prompt built server-side from entity structured fields - `generate-advanced` endpoint builds prompt via `buildCharacterPrompt/buildSetPrompt/buildPropPrompt` when empty prompt sent - IMPORTANT: disable generate while entity save is in flight (`saving`) so users cannot save and immediately regenerate against stale DB state ### UX Principle: No Duplicate Controls - NEVER put duplicate prompts, guide images, or settings inside an "advanced" accordion - "Advanced" should only ADD controls (e.g. model selector) on top of existing UI - Alex explicitly rejected the pattern of an accordion that duplicated the prompt and guide images - The model selector is just a dropdown near the generate button, not a separate section ## Landing Page / Account Panel - Home page now has a top-right `⚙️ Account` button above the `Video Story` title - The account view is a right-side drawer, following the same `fixed inset-0 z-50 flex justify-end` pattern as ActivityPanel / DetailPanel - Account data comes from `GET /api/account/overview` - `server/account-status.js` is the source of truth for: - provider credit/account summaries - grouped model inventory by platform - This endpoint should report reality, not aspirational balances. If a provider does not expose credit balance with the current key/API, surface that clearly in `note` ## Provider Billing Reality - **fal.ai** billing balance is available via `GET https://api.fal.ai/v1/account/billing?expand=credits` - BUT fal.ai only allows this for **ADMIN keys**. Standard generation keys return 403 with an insufficient-permissions message - Therefore, if the app uses a non-admin `FAL_KEY`, the UI must show balance unavailable and explain that billing reads require an ADMIN key - **Replicate** `GET https://api.replicate.com/v1/account` returns account identity (e.g. username) but does **not** expose remaining credit balance in the currently used API flow - Therefore, the app should show Replicate account identity if available, but mark credit balance unavailable instead of faking a number - `server/account-status.js` imports `dotenv/config` directly so standalone scripts/tests loading it still see `.env` ## PWA Patterns - Service Worker: skipWaiting + clientsClaim for immediate updates on deploy - registerType: 'autoUpdate' in Vite PWA config - iOS downloads: Use navigator.share() with blob — `` fails on iOS PWA - Desktop downloads: Blob URL + programmatic click - Server has /api/projects/:id/download-video endpoint with Content-Disposition: attachment - Close button safe area: Use env(safe-area-inset-top) for phone status bar clearance - Min tap targets: 44x44px for mobile ## YOLO Progress & Live Updates - Status endpoint returns `yoloStep` (camelCase), recent logs array - Frontend polls every 3s during active runs - Animated spinner + live activity feed with timestamps - Verbose server-side logging for each sub-step (per-scene, per-frame progress) - **refreshKey pattern**: ProjectPage increments `refreshKey` every 5s during YOLO. ALL phase components (Analysis, Timeline, Images, Video, Export) must accept `refreshKey` prop and include it in their `useEffect` fetch dependency array: `useEffect(() => { fetchData() }, [project.id, refreshKey])` - Bug history: originally only AnalysisPhase had refreshKey — Timeline/Images/Video/Export were stale during YOLO until user switched tabs. Fixed by passing refreshKey to every phase. ## Audio-Video Sync (FIXED) - Audio is the master timeline — shots derive durations from script_segments, not LLM - `breakdownShots()` LLM decides cinematography; code computes `duration_ms` from segment timestamps - Shot durations use "full slot" timing: from segment's start_time_ms to next segment's start_time_ms (includes 150ms silence gaps) - Durations rounded to 0.5s for Replicate (WAN 2.2 accepts floats 0.5–10) - Export trims each clip to exact target duration with ffmpeg `-t` flag - Final merge uses audio duration as target: `-t ${audDuration}` (not -shortest) - `actual_video_duration_ms` stored after generation for drift monitoring - Venice AI rejected as provider: only supports integer-second enums, pipeline needs fractional durations ## Advanced Image Models (fal.ai) - All models use `image_urls` (array) for reference images — including Qwen standard Edit - Bug history: Qwen Edit was sending `image_url` (single string) which caused 422. ALL Qwen variants need array. - Model param mapping in `advanced-images.js`: kontext/reve use `image_url` (single), everything else uses `image_urls` (array) - When no prompt provided to `generate-advanced`, server auto-builds from entity fields using same functions as default generate - IMPORTANT nuance: if the default detail-panel flow sends an empty prompt for advanced generation, project style is still injected indirectly because the server rebuilds via `buildCharacterPrompt/buildSetPrompt/buildPropPrompt` - If you want truly manual/no-style-injection advanced prompting, the client must send an explicit full prompt instead of `prompt: ''` ## Lip Sync (dialogue shots) - **Module**: `server/lipsync.js` - **Model**: Kling Lip Sync via Replicate (`kwaivgi/kling-lip-sync`) at $0.014/sec - **Fallback**: Sync Lipsync 2 (`sync/lipsync-2`) at $0.05/sec - **How it works**: Post-processing step — takes existing video clip + extracted audio segment, produces lip-synced version - **Pipeline integration**: Step 9 in YOLO (between Videos and Export). Non-blocking — errors don't stop export. - **Audio extraction**: `ffmpeg -ss {start} -t {duration}` from full narration to get per-shot dialogue audio - **DB**: `lipsync_url` and `lipsync_status` columns on shots table - **Export**: `COALESCE(lipsync_url, video_url)` — prefers lip-synced version when available - **Only dialogue shots**: Identified by `segment_type = 'dialogue'` on the shot's linked script_segment - **API endpoints**: - `GET /api/projects/:id/lipsync-status` — status + cost estimate - `POST /api/projects/:id/lipsync-all` — batch process (runs in background, logs progress) - `POST /api/shots/:id/lipsync` — single shot (synchronous) - **Kling constraints**: video 2-10 sec, 720p-1080p — matches our shot lengths perfectly - **Cost**: ~$0.31 for 7 dialogue shots (22 sec total). Very cheap addition to pipeline. ## Pitfalls - Video generation queue is in-memory but shot status is SQLite-backed. After process restart, stale `video_status='generating'` rows can survive with no active job. `getQueueStatus(projectId)` should reset those rows to `pending` when `activeJobs===0` and `queue.length===0`, and `/video-status` should expose active/waiting/halted/maxConcurrent so the UI can explain what is actually running. See `references/video-queue-refresh-and-qwen-pro-ref-cap-2026-05-18.md`. - PWA cache: After deploy, users may need to close+reopen PWA for SW update - Safari vs PWA: Separate caches — Safari browser cache ≠ PWA SW cache - Shot breakdown: Must retry per-scene (not fail entire step on one scene failure) - YOLO must validate ALL scenes have shots before proceeding to refs - OpenRouter 429s: Already handled with fallback, but batch operations need throttling - Steps nav: horizontal scrollable top bar (not bottom tab bar) — user preference - **Mobile touch**: hover-only UI (opacity-0 group-hover:opacity-100) doesn't work on touch devices. Delete buttons on guide images must be always-visible (red circle badge at -top-1 -right-1), not hover overlays. - **No duplicate controls in "advanced" sections**: Alex explicitly rejected accordion patterns that duplicate prompt/guide-images inside an advanced panel. Advanced = just a model selector dropdown, reusing existing data. - **Prompt staleness after entity edits**: if a user edits prompt-driving fields (appearance, set description, prop description, mood, etc.), previously auto-generated `reference_image_prompt` values can become stale. On save, invalidate auto-generated prompts (but preserve `[ADVANCED:...]` and `[CUSTOM_UPLOAD]` prompts) so later regeneration rebuilds from current structured fields. - **Frame/reference reset UX matters**: when users clear a reference image or frame, only null the generated asset URL (`reference_image_url`, `first_frame_url`, `last_frame_url`). Do not delete guide images or prompts. Clearing is for resetting generation state, not deleting upstream inputs. ## Hermes automation API workflow - Prefer a small authenticated local API bridge for Hermes-driven operation instead of browser-driving the app. - Useful endpoints: - `POST /api/hermes/projects/create-and-run` - `GET /api/hermes/projects/:id/status` - `GET /api/hermes/projects/:id/export` - The Hermes task is not complete when the pipeline finishes server-side; it is complete only after Hermes fetches the export and delivers the final video back to the user/chat. - Good defaults that worked here: - `aspect_ratio`: `16:9` unless portrait requested - `reference_image_model`: `qwen` - `frame_image_model`: `qwen` - `references/video-queue-refresh-and-qwen-pro-ref-cap-2026-05-18.md` — session detail for stale video `generating` rows after PM2 restart, `maxConcurrent` status reporting, and the live Fal Qwen Pro Edit 3-reference cap. - Progress/state polling should drive ALL major tabs/components, not just analysis, or the UI goes stale during long runs. - Save prompt fields needed for retries BEFORE generation starts; do not only persist them on success. - Retry flow should switch providers intelligently for safety/moderation failures instead of hammering the same provider again. - For frame retries, reuse the same ranked reference bundle as the primary path; do not degrade to weaker text-only retries if continuity refs exist. - Lip-sync stalls may actually be post-provider finalization hangs (download/upload/final save), not the external model itself still running. - Keep export non-blocking with `COALESCE(lipsync_url, video_url)` so one failed lipsync does not destroy the whole pipeline.