fal-replicate-model-inventory

/home/avalon/.hermes/skills/research/fal-replicate-model-inventory/SKILL.md · raw

fal.ai + Replicate + Venice AI Model Inventory

Maintain a reusable inventory of image and video generation models across fal.ai, Replicate, and Venice AI.

Use this skill when: - the user asks what models are available on fal.ai, Replicate, or Venice AI - the user wants current API pricing for image/video models - the user is choosing between providers for a new app - the user wants to update a model picker/config like fal-studio - you need a grounded comparison of image vs video models, edit models, FLF/first-last-frame models, or provider-specific endpoints

Grounding policy

Always separate findings into two buckets: 1. Live-verified now 2. Historical/session-recalled notes

Do not present session-recalled prices as freshly verified. Label them clearly.

Registry coverage gate

Before treating Hermes Model Intelligence or any shared registry as “latest,” verify provider-scoped coverage and freshness—not only the global sync timestamp. Confirm that each advertised provider has a real ingestion adapter, nonzero provider inventory, a recent completed sync, and visible sync errors/staleness. A fresh fal sync does not make Venice fresh; provider: venice -> 0 may mean an ingestion gap rather than no Venice models.

For Venice, reconcile the authenticated image/video catalogs with the full documentation export, family guides, and live quote API. Venice /models can omit documented and callable deployments, so never make an absence claim from that endpoint alone. Preserve model-level provenance, evidence strength, contradictions, disappearance/reappearance, and payment/operational metadata.

Use references/provider-coverage-and-venice-reconciliation.md for the provider-freshness gate, Venice multi-source pattern, creative-capability normalization, x402 metadata, and end-to-end verification checklist.

Release chronology and recency bias

When inventory data drives model selection, keep provider publication, first observation, and last verification as separate timestamps. Never treat a bulk import's discovered_at as a model release date.

Default chronological views to newest source-dated deployments, but keep recency subordinate to hard capability fit and execution evidence. State the release date kind and source in recommendations; provider_published means availability on that provider, not automatically the upstream lab announcement.

Use references/release-chronology.md for the embedded-provider-metadata extraction pattern, provenance contract, recency bands, agent-facing context requirements, change logging, and end-to-end verification checklist.

Preferred live-check workflow

1) If web_search/web_extract are available and funded

Use them first for quick retrieval.

2) If web_search/web_extract fail due credits or scraping limits

Use browser tools instead.

Recommended live sources: - https://fal.ai/pricing - https://replicate.com/pricing - https://docs.venice.ai/overview/pricing - https://api.venice.ai/api/v1/models (currently text-only in the public unauthenticated response; use docs pricing for image/video tables) - provider model pages linked from those pricing pages

For Venice AI API pricing: 1. Fetch https://docs.venice.ai/overview/pricing and extract HTML tables. The image tables include fixed prices; the video table often says Variable. 2. Use POST https://api.venice.ai/api/v1/video/quote for live video pricing. It can return quotes without auth for many public pricing inputs; if auth becomes required, use docs/table extraction and label it. 3. Quote body pattern: json {"model":"seedance-2-0-text-to-video","duration":"5s","aspect_ratio":"16:9","resolution":"720p","audio":true} 4. Normalize video prices to $/second and also report the actual quoted duration, because Venice durations are model-specific enums.

Reference note: see references/venice-ai-api-pricing-2026.md for a live May 2026 comparison snapshot and reusable quote commands.

Reference note: see references/venice-reference-capabilities-2026.md for a live May 2026 snapshot comparing Venice, fal, and Replicate on image-edit references, video R2V references, private/uncensored model posture, and accepted clip-duration enums.

Reference note: see references/provider-parameter-surface-comparison.md for the July 2026 cross-provider parameter audit, normalized capability contract, and verified differences for GPT Image 2, Nano Banana Pro, Recraft V4, Ideogram V4, Grok, Flux, Kling O3, Veo 3.1, LTX 2.3, Wan 2.7, and Seedance 2.0.

Reference note: see references/higgsfield-model-surface-pricing-audit.md for the July 2026 Higgsfield Gemini Omni Flash snapshot and a reusable workflow for auditing credit-based creative platforms by reconciling the live form, frontend validation/cost modules, current plan table, marketing claims, and upstream model docs. Use this pattern when a provider's landing page advertises capabilities that its actual request contract does not expose.

Reference note: see references/gemini-omni-provider-surface-2026-07.md for the direct Google vs fal vs Venice vs Higgsfield Omni contract and price comparison, including the crucial distinction between native previous_interaction_id state, stateless output-chaining, and platform-advertised conversation.

Reference note: see references/grok-imagine-vs-seedance-audio-refs-2026.md for live June 2026 fal schema/pricing notes comparing Grok Imagine Video 1.5 and Seedance 2.0 audio-reference support, including a successful Grok 15s talking-portrait request pattern.

Reference note: see references/fal-transparent-video-background-removal-2026.md for fal video-background-removal endpoint schemas and a transparent WebM workflow: generate/review a high-quality MP4 first, then postprocess with Bria/Veed background removal rather than relying on I2V alpha output or local keying hacks.

Reference note: see references/audio-driven-talking-video.md when the supplied or named-voice recording must remain the actual speaking performance. It distinguishes audio conditioning from true avatar/lip-sync contracts, covers exact-audio remuxing and Cartesia named-voice lookup, and requires source-aware voice correction: rerun a still portrait directly through the original audio-driven avatar model when quality matters; reserve post-hoc video lipsync for pre-existing moving footage or cases where preserving source motion outweighs facial-synthesis quality.

For browser extraction: 1. browser_navigate(url) 2. browser_snapshot(full=true) 3. browser_console(expression='document.body.innerText.slice(0,12000)') 4. extract only the visible, grounded rows and links

Current live-verified pricing anchors (captured April 2026)

fal.ai pricing page

Verified directly from fal.ai/pricing using browser tools:

Video: - Wan 2.5 — $0.05 / second - Kling 2.5 Turbo Pro — $0.07 / second - Veo 3 — $0.4 / second - Ovi — $0.2 / video

Image: - Seedream V4 — $0.03 / image - Flux Kontext Pro — $0.04 / image - Nanobanana — $0.0398 / image - Qwen — $0.02 / megapixel

Compute: - H100 — $1.89/hr / $0.0005/s - H200 — $2.10/hr / $0.0006/s - A100 — $0.99/hr / $0.0003/s

Replicate pricing page

Verified directly from replicate.com/pricing using browser tools:

Public model examples: - black-forest-labs/flux-1.1-pro — $0.04 / output image - black-forest-labs/flux-dev — $0.025 / output image - black-forest-labs/flux-schnell — $3.00 / thousand output images - ideogram-ai/ideogram-v3-quality — $0.09 / output image - recraft-ai/recraft-v3 — $0.04 / output image - wavespeedai/wan-2.1-i2v-480p — $0.09 / second of output video - wavespeedai/wan-2.1-i2v-720p — $0.25 / second of output video

Hardware pricing: - gpu-a100-large — $0.001400/sec / $5.04/hr - gpu-h100 — $0.001525/sec / $5.49/hr - gpu-l40s — $0.000975/sec / $3.51/hr - gpu-t4 — $0.000225/sec / $0.81/hr

Venice AI pricing/docs

Verified May 2026 via direct HTTP fetches of docs.venice.ai/overview/pricing and live POST /api/v1/video/quote calls.

Image examples: - qwen-image — $0.01 / image - grok-imagine-image — $0.03 / image (private) - flux-2-pro — $0.04 / image - qwen-image-2 — $0.05 / image - seedream-v4 / seedream-v5-lite — $0.05 / image - flux-2-max — $0.09 / image - qwen-image-2-pro — $0.10 / image - nano-banana-2 — $0.10–$0.19 / image depending resolution - nano-banana-pro — $0.18–$0.35 / image depending resolution

Video quote examples (normalize but also report source quote): - kling-2.5-turbo-pro-text-to-video — $0.39 / 5s ≈ $0.078/s - kling-v3-pro-text-to-video — $0.49 / 4s audio off ≈ $0.1225/s; $0.74 / 4s audio on ≈ $0.185/s - seedance-2-0-text-to-video 720p — $0.72 / 4s ≈ $0.18/s - seedance-2-0-fast-text-to-video 720p — $0.58 / 4s ≈ $0.145/s - veo3.1-fast-text-to-video 720p/1080p — $0.44 / 4s audio off ≈ $0.11/s; $0.66 / 4s audio on ≈ $0.165/s - veo3.1-full-text-to-video 720p/1080p — $0.88 / 4s audio off ≈ $0.22/s; $1.76 / 4s audio on ≈ $0.44/s - wan-2-7-text-to-video — $0.55 / 5s at 720p ≈ $0.11/s; $0.70 / 5s at 1080p ≈ $0.14/s - ltx-2-v2-3-fast-text-to-video 1080p — $0.40 / 6s ≈ $0.0667/s

Live-verified June 2026 model notes

fal video background removal / transparent WebM postprocess:
bria/video/background-removal/v3: required video_url; supports background_color: "Transparent", preserve_audio, and output_container_and_codec: "webm_vp9". Use this as the preferred postprocess when the production target is transparent WebM alpha from an opaque high-quality I2V MP4.
bria/video/background-removal: same basic shape as v3; verify current schema before use.
veed/video-background-removal and /fast: required video_url; supports output_codec: "vp9", subject_is_person, and refine_foreground_edges.
App-wiring pitfall: these are video-to-video endpoints and expect video_url, not image_url.
Workflow pitfall: most high-quality I2V models output normal MP4 without alpha; generate/review the MP4 first, then run a real video background-removal model for transparent WebM. Do not present local color-keying as equivalent unless labeled fallback.
fal Kling Video v3 Pro I2V:
fal-ai/kling-video/v3/pro/image-to-video requires the reference as start_image_url, not image_url; supports duration, negative_prompt, generate_audio, shot_type, and cfg_scale.
For preserving an existing transparent PNG reference, first composite the PNG onto a plain neutral/warm-white background if the model may hallucinate checkerboard/alpha; then remove that background after I2V.
fal Grok Imagine Video 1.5:
xai/grok-imagine-video/v1.5/image-to-video requires prompt + image_url; supports integer duration 1..15 and resolution 480p/720p.
Verified OpenAPI fields do not include audio_url, audio_urls, or generate_audio: Grok generates native audio, but does not accept user-provided audio file references on fal.
Pricing observed on fal page: $0.08/s at 480p, $0.14/s at 720p, plus $0.01 per input image; generated audio included.
Practical use: cheap single-image talking/moving portrait where exact audio is not required. For user-provided audio that only needs to guide cinematic motion, consider Seedance 2.0 reference-to-video. When the uploaded recording must remain the actual speech performance, use a dedicated image+audio avatar or video+audio lip-sync endpoint from references/audio-driven-talking-video.md; do not treat Seedance reference audio as exact-audio preservation.
fal Seedance 2.0 audio-reference distinction:
bytedance/seedance-2.0/text-to-video and /image-to-video expose native generate_audio, but no user audio-reference upload fields.
bytedance/seedance-2.0/reference-to-video and /fast/reference-to-video expose audio_urls alongside image_urls and video_urls: up to 3 MP3/WAV files, combined duration <= 15s, max 15 MB each; if audio is supplied, at least one image/video reference is required. References are addressed as @Audio1, @Image1, @Video1 in the prompt.
Pricing observed: fast 720p about $0.2419/s; standard 720p about $0.3034/s; standard 1080p about $0.682/s.
fal LTX 2.3 Quality endpoints (checked from fal recently-added + OpenAPI/llms.txt):
fal-ai/ltx-2.3-quality/text-to-video: prompt-only; num_frames 9..481, frames_per_second 1..60, resolution enum/custom, generate_audio default true; price $0.0024075/MP of generated video data. At 1280×720/24fps ≈ $0.053/s; at 1920×1080/24fps ≈ $0.120/s.
fal-ai/ltx-2.3-quality/image-to-video: requires prompt + image_url; same frame-count contract and price; supports image_strength, audio generation, prompt expansion, quality/write-mode controls. Important: verified OpenAPI fields do not include end_image_url, so this is not a direct first+last-frame swap for the 22B Distilled I2V profile in Video Story classic mode; it is a first-frame animation profile unless paired with another reference/video workflow.
fal-ai/ltx-2.3-quality/audio-to-video: requires prompt + audio_url, optional image_url; match_audio_length default true, otherwise num_frames; same $0.0024075/MP; strong candidate for dialogue/performance modes when local audio exists.
fal-ai/ltx-2.3-quality/reference-video-to-video: requires prompt + video_url; num_frames/frames_per_second, video_strength, optional generated audio; same $0.0024075/MP; more useful as repair/restyle/continuation than first-pass classic shots.
fal-ai/ltx-2.3-quality/hdr and /hdr/lora: video-to-HDR workflows from video_url; HDR LoRA price $0.0027075/MP (≈ $0.060/s at 720p/24fps, ≈ $0.135/s at 1080p/24fps); post/finishing candidate, not a primary story generator.
Quality /lora variants exist for text/image/audio/HDR; they require loras (max 3, up to 3GB each) and are useful for style/brand fine-tunes, not default app profiles until LoRA upload/selection UX exists.
fal LTX 2.3 Fast endpoints:
fal-ai/ltx-2.3/text-to-video/fast: 1080p $0.04/s, 1440p $0.08/s, 2160p $0.16/s; duration enum 6,8,10,12,14,16,18,20; fps enum 24,25,48,50; generate_audio boolean.
fal-ai/ltx-2.3/image-to-video/fast: 1080p $0.06/s, 1440p $0.12/s, 2160p $0.24/s; requires image_url + prompt, optional end_image_url; same duration/fps enums; generate_audio boolean.
fal LTX 2.3 22B Distilled:
fal-ai/ltx-2.3-22b/distilled/text-to-video: priced by generated video megapixels, $0.001205/MP; supports num_frames integer 9..481, fps, video_size, audio, prompt expansion, scheduler/acceleration controls. Strong Video Story candidate for frame-precise duration because duration can be derived from num_frames / fps rather than integer-second enums.
fal-ai/ltx-2.3-22b/reference-video-to-video: $0.001605/MP; requires video_url; optional audio_url, image_url, end_image_url; supports match_video_length, num_frames, match_input_fps, fps, strength/guidance controls.
fal Wan 2.7:
fal-ai/wan/v2.7/text-to-video: $0.10/s 720p, $0.15/s 1080p; duration enum integer 2..15; supports optional audio_url; aspect ratios 16:9,9:16,1:1,4:3,3:4.
fal-ai/wan/v2.7/image-to-video: same pricing and integer 2..15; optional image_url, end_image_url, video_url, audio_url; can do first-frame, first+last-frame, video continuation, and audio-driven mode.
Atlas Cloud Wan 2.7:
alibaba/wan-2.7/text-to-video and alibaba/wan-2.7/image-to-video use POST https://api.atlascloud.ai/api/v1/model/generateVideo; price shown as from $0.10/s; duration 2..15; 720P/1080P; image-to-video uses image, optional last_image, optional audio; polling result URL can appear under data.outputs[].
Replicate LTX 2.3:
lightricks/ltx-2.3-pro: tasks text_to_video, image_to_video, audio_to_video, retake, extend; optional last_frame_image; duration enum 6,8,10; fps enum 24,25,48,50; 1080p $0.08/s, 2K $0.16/s, likely 4K $0.32/s from search snippet; native audio default true.
lightricks/ltx-2.3-fast: pricing snippet shows 1080p $0.06/s, 2K $0.12/s.

Historical/session-recalled model families to check

These came from prior sessions and are useful starting points, but should be re-verified before presenting as current pricing:

fal.ai families

Image: - FLUX: flux-pro/v1.1, flux-pro/v1.1-ultra, flux-2-pro, flux-pro/kontext, flux-pro/kontext/max, flux/dev, flux/dev/image-to-image, flux/schnell, flux-realism, flux-lora, flux-pro/v1/fill - Qwen: qwen-image-edit, qwen-image-2/edit, and sometimes higher tiers/pro variants - Nano Banana: nano-banana, nano-banana-2, nano-banana-2/edit, nano-banana-pro, nano-banana-pro/edit - Ideogram, Recraft, Seedream, Reve, Imagen (availability varies)

Video / FLF candidates: - Wan family - Kling family - Veo family - MiniMax / Hailuo family - Vidu family - Ovi

Venice AI families

Image: - Qwen Image / Qwen Image 2 / Qwen Image 2 Pro - FLUX.2 Pro / FLUX.2 Max - Nano Banana 2 / Nano Banana Pro - Seedream v4/v5, Recraft, Grok Imagine, GPT Image variants

Video: - Kling 2.5 / Kling 3 / Kling O3 - Veo 3 / Veo 3.1 fast/full - Seedance 2.0 / fast / reference-to-video - Wan 2.5 / 2.6 / 2.7 - LTX 2.x, HappyHorse, PixVerse, Runway, Vidu, Ovi, Grok Imagine

Replicate families

Image: - FLUX 1.1 Pro / Dev / Schnell - Recraft - Ideogram - Stable Diffusion 3.5 (availability can change)

Video: - Wan family - Kling family - Veo family - Luma / Ray family - Sora family - Hailuo / MiniMax family - CogVideoX / LTX / Grok depending on current catalog

Updating an app inventory (example: fal-studio)

When maintaining a local picker/config such as src/models.js:

Read the current registry file.
Group entries by family, type, price unit, and whether they support image inputs.
Compare live provider pages against the registry.
Mark entries as: - confirmed current - likely stale - missing from app but available upstream
Only change code after confirming endpoint names and parameter expectations.

For fal-studio, the model registry lives in: - src/models.js

Best source for exact fal.ai input options

For fal.ai model capabilities, prefer the endpoint OpenAPI spec and llms.txt over marketing copy.

Reliable pattern: 1. Open pricing/model page to discover likely endpoint IDs. 2. Pull OpenAPI JSON directly: - https://fal.ai/api/openapi/queue/openapi.json?endpoint_id=ENDPOINT_ID 3. Inspect: - paths[*].post.requestBody.content.application/json.schema - components.schemas.*Input - components.schemas.*Output 4. Optionally read: - https://fal.ai/models/ENDPOINT_ID/llms.txt

Why this matters: - the OpenAPI file gives the exact request fields, required params, enums, defaults, and output shape - llms.txt often adds practical notes like prompt syntax (@Image1, @Image2) and pricing examples

Reusable fal video-model patterns discovered

These endpoint families were confirmed via OpenAPI + llms.txt and are useful for building model-specific UI:

fal-ai/wan-25-preview/text-to-video
prompt-only video
supports: resolution, duration, aspect_ratio, audio_url, negative_prompt, seed, enable_prompt_expansion, enable_safety_checker
fal-ai/wan-25-preview/image-to-video
single first-frame image-to-video
required: prompt, image_url
also supports audio_url, resolution, duration, negative_prompt, seed, enable_prompt_expansion, enable_safety_checker
fal-ai/wan-flf2v
dedicated first/last frame video
required: prompt, start_image_url, end_image_url
supports num_frames, frames_per_second, resolution, aspect_ratio, guide_scale, num_inference_steps, acceleration, shift, negative_prompt, seed
fal-ai/kling-video/v2.5-turbo/pro/image-to-video
first-frame image-to-video with optional tail frame
required: prompt, image_url
optional end/tail frame field is tail_image_url
supports duration, cfg_scale, negative_prompt
fal-ai/kling-video/o3/standard/image-to-video
first-frame image-to-video with optional end frame and multi-shot support
required: image_url
supports either prompt or multi_prompt (not both), optional end_image_url, duration, generate_audio, shot_type
good candidate for a separate “multi-shot prompts” section in UI
fal-ai/kling-video/o1/standard/image-to-video
first-frame with optional last frame
required: prompt, start_image_url
optional end_image_url
llms.txt explicitly says prompt can reference frames with @Image1 and @Image2
fal-ai/veo3.1/image-to-video
first-frame image-to-video
required: prompt, image_url
supports resolution, duration, aspect_ratio, generate_audio, negative_prompt, auto_fix, safety_tolerance, seed
fal-ai/veo3.1/first-last-frame-to-video
dedicated first/last frame video
required: prompt, first_frame_url, last_frame_url
supports resolution, duration, aspect_ratio, generate_audio, negative_prompt, auto_fix, safety_tolerance, seed
fal-ai/veo3/image-to-video
similar to Veo 3.1 image-to-video
required: prompt, image_url
fal-ai/minimax/hailuo-02/standard/image-to-video
first-frame image-to-video with optional end frame
required: prompt, image_url
optional end_image_url
supports duration, resolution, prompt_optimizer
fal-ai/vidu/start-end-to-video
dedicated first/last frame video
required: prompt, start_image_url, end_image_url
supports movement_amplitude, seed
fal-ai/vidu/reference-to-video
multi-reference image-to-video
required: prompt, reference_image_urls
supports aspect_ratio, movement_amplitude, seed
this is the clearest confirmed example of a provider model that wants a general multi-image reference section rather than first/last frame fields
fal-ai/ovi/image-to-video
single reference image-to-video
required: prompt, image_url
supports num_inference_steps, negative_prompt, audio_negative_prompt, seed

UI design rule for fal-studio-like apps

Do not use one generic “reference image” uploader for every model. Use separate capability-driven sections: - General reference images - First frame - Last frame - Optional special sections such as multi-shot prompts

And drive them from model metadata, not hardcoded string checks.

Recommended capability flags per model: - mediaKind: image or video - supportsPrompt - supportsGeneralImageRefs - generalImageRefField - minGeneralImageRefs - maxGeneralImageRefs - supportsFirstFrame - firstFrameField - supportsLastFrame - lastFrameField - supportsAudioReference - audioField - supportsMultiPrompt - multiPromptField - supportsTailFrame if provider uses a nonstandard end-frame field like tail_image_url

Important implementation note: - first/last frame sections should be distinct from general image reference sections - if a model supports only first frame, show only that section - if it supports general refs (like reference_image_urls), show a multi-upload reference section - if the provider uses nonstandard field names (tail_image_url, first_frame_url, start_image_url), map them explicitly in metadata instead of branching on substring matches in backend code

Output format recommendation

When reporting back to the user, use:

Live-verified now
Normalized cost comparison — same operation, duration, resolution, aspect ratio, and audio state
Parameter-surface differences — reference limits, frame/audio inputs, timing, quality/FPS/seed/safety controls, task modes, privacy, and delivery behavior
Historical notes to verify
Recommended models by use case — state whether the recommendation optimizes price, accepted-output cost, controllability, or privacy
App inventory gaps / stale entries

Do not call two deployments equivalent merely because they share a model-family name. A generic provider API may omit model-native controls available through another provider, while a private or structured-reference deployment may justify a higher price. Use references/provider-parameter-surface-comparison.md for the normalized comparison contract.

Pitfalls

Pricing pages often show only featured examples, not the full catalog.
fal.ai mixes per-image, per-megapixel, per-second, and per-video billing.
Replicate mixes output-based pricing with hardware-time pricing.
Venice AI image pricing is visible in docs tables, but video pricing is often listed as Variable; use the Venice Video Quote API and normalize to $/second instead of guessing from the table.
Venice https://api.venice.ai/api/v1/models may return only text models when fetched unauthenticated even though docs list image/video models; do not conclude image/video are unavailable from that endpoint alone.
A session-recalled endpoint may have been renamed or removed.
Do not assume 401 Authentication is required means the user must manually activate a model. In fal-studio testing, the same 401 message appeared across many unrelated fal endpoints when the saved key was invalid/test-placeholder. First verify the saved key against multiple models before blaming model gating.
Browser extraction is currently more reliable than web_extract here when Firecrawl credits are exhausted.
fal billing endpoint paths may vary. The initially attempted https://rest.alpha.fal.ai/account/billing?expand=credits returned 404 in live testing here, so implement credit lookup with endpoint fallbacks rather than assuming a single fixed path.
For fal.ai model capability discovery, the most reliable source is often the endpoint OpenAPI schema directly: https://fal.ai/api/openapi/queue/openapi.json?endpoint_id=.... This exposes the exact request fields, required uploads, enums, and output types even when the marketing page is incomplete.
llms.txt pages are useful to recover human-readable pricing notes and prompt conventions (for example Kling O1's @Image1 / @Image2 references), but the OpenAPI schema should be treated as the source of truth for app wiring.
Some video endpoints accept optional alternate prompt structures (for example Kling O3 multi_prompt) and may not require the normal text prompt when the alternate field is present. UI validation should account for that.
Some “special reference” workflows do not have a unique dedicated API field. Example: Vidu reference-to-video uses reference_image_urls; a 9-frame storyboard/grid image can still be surfaced as a separate UI section and merged into that array on submit.

Quick recommendations by use case

Cheapest fast images: FLUX Schnell on either provider family where available
Premium image editing on fal: Flux Kontext / Qwen / Nano Banana edit tiers
Typography/design: Ideogram, Recraft, Qwen, Seedream
Budget video: Wan family
Premium video: Kling / Veo tiers
First-last-frame workflows: Wan, Kling, Veo, Hailuo, Vidu families (verify exact endpoint syntax)

Video Story-specific decision rules (live-verified April 2026)

Use these when evaluating models for Alex's video-story app.

1) Fractional-duration video is the gating constraint

Video Story derives clip lengths from audio timing and needs sub-second durations.

Live-verified fit: - replicate.com/lucataco/wan-2.2-first-last-frame shows duration_seconds as a number with minimum 0.5 and maximum 10. - This makes the current Replicate Wan 2.2 first/last-frame model a strong fit for Video Story's timing model.

Live-verified fal limitations: - fal-ai/wan-25-preview/text-to-video and /image-to-video expose duration enums of only 5 or 10 - fal-ai/kling-video/v2.5-turbo/pro/image-to-video exposes only 5 or 10 - fal-ai/veo3.1/first-last-frame-to-video exposes 4s, 6s, 8s - fal-ai/minimax/hailuo-02/standard/image-to-video exposes 6 or 10 - fal-ai/kling-video/o1/standard/image-to-video and o3/standard/image-to-video allow more durations, but still only integer-second enums

Practical rule: - If the app must preserve audio-aligned durations like 0.5s, 1.5s, 2.5s, etc., most current fal video endpoints are a bad direct fit without padding, retiming, or export-time trimming hacks. - For Video Story, treat Replicate Wan 2.2 FLF as the current benchmark unless a newer model is verified to accept numeric fractional durations.

2) Trust OpenAPI over marketing copy for image-input requirements

Live-verified examples: - fal-ai/flux-pro/kontext OpenAPI requires image_url - fal-ai/qwen-image requires only prompt - fal-ai/qwen-image-2/edit and fal-ai/qwen-image-2/pro/edit require prompt + image_urls - fal-ai/nano-banana/edit, nano-banana-2/edit, and nano-banana-pro/edit require prompt + image_urls - fal-ai/flux-2-pro/edit requires prompt + image_urls

Practical rule: - Do not assume a model marketed as "can generate new images from text" is a drop-in text-only endpoint for production. Verify whether the schema actually requires an image field. - For Video Story, flux-pro/kontext is best treated as an edit model, not a pure text-only reference generator.

3) Video Story reference-generation buckets

Separate decisions into two different jobs:

A. Text-only reference generation (characters / sets / props before refs exist)

Live-verified candidates: - fal-ai/qwen-image — text-only, $0.02 / megapixel - fal-ai/bytedance/seedream/v4/text-to-image — text-only, $0.03 / image - fal-ai/nano-banana — text-only, about $0.039 / image - replicate.com/black-forest-labs/flux-1.1-pro — text-only with optional composition guidance, $0.04 / output image - replicate.com/black-forest-labs/flux-2-pro — text-only or multi-ref, pricing is mixed: $0.015 / run + $0.015 / input MP + $0.015 / output MP

B. Reference-guided frame generation / compositing

Live-verified candidates: - replicate.com/black-forest-labs/flux-2-pro — up to 8 reference images on API - fal-ai/flux-2-pro/edit — requires image_urls, priced from $0.03 for first output MP plus extra MP charges - fal-ai/qwen-image-2/edit — requires image_urls with 1-3 images, $0.035 / image - fal-ai/qwen-image-2/pro/edit — requires image_urls with 1-3 images, $0.075 / image - fal-ai/nano-banana/edit — requires image_urls, about $0.039 / image - fal-ai/nano-banana-2/edit — requires image_urls, $0.08 / image base with resolution/thinking surcharges - fal-ai/nano-banana-pro/edit — requires image_urls, $0.15 / image base - fal-ai/flux-pro/kontext / kontext/max — single-image edit flow, not multi-ref compositing

4) Video Story app-audit note: current "qwen" path is a hybrid

The current app implementation is not just "Qwen" in one uniform sense: - text-only references go through fal-ai/qwen-image - reference-guided shots go through fal-ai/qwen-image-2/pro/edit

This matters because: - pricing is different between the text-only and edit legs - ref-count limits are different from some local registry assumptions - quality/capability discussions should distinguish qwen-image from qwen-image-2/pro/edit

5) Video Story model-selection guidance

If the user asks which models are best for Video Story specifically, use this default reasoning: - Best current video fit: Replicate Wan 2.2 FLF, because fractional duration support beats today's fal video enums - Best budget text-only references: fal Qwen Image - Best likely upgrade candidate for pure text-only references: fal Seedream V4 - Best current multi-reference frame candidate on paper: Replicate FLUX.2 Pro (8 refs) or fal Nano Banana 2 / Pro when very high reference count matters - Best single-reference edit model: fal Flux Kontext / Kontext Max - Do not recommend a fal video model as the primary Video Story generator unless its duration API is verified to support numeric fractional timing

Maintenance rule

If you use this skill and discover renamed endpoints, broken pricing assumptions, or new provider pages, patch this skill immediately and update references/inventory.md.