Maintain a reusable inventory of image and video generation models across fal.ai, Replicate, and Venice AI.
Use this skill when:
- the user asks what models are available on fal.ai, Replicate, or Venice AI
- the user wants current API pricing for image/video models
- the user is choosing between providers for a new app
- the user wants to update a model picker/config like fal-studio
- you need a grounded comparison of image vs video models, edit models, FLF/first-last-frame models, or provider-specific endpoints
Always separate findings into two buckets: 1. Live-verified now 2. Historical/session-recalled notes
Do not present session-recalled prices as freshly verified. Label them clearly.
Use them first for quick retrieval.
Use browser tools instead.
Recommended live sources:
- https://fal.ai/pricing
- https://replicate.com/pricing
- https://docs.venice.ai/overview/pricing
- https://api.venice.ai/api/v1/models (currently text-only in the public unauthenticated response; use docs pricing for image/video tables)
- provider model pages linked from those pricing pages
For Venice AI API pricing:
1. Fetch https://docs.venice.ai/overview/pricing and extract HTML tables. The image tables include fixed prices; the video table often says Variable.
2. Use POST https://api.venice.ai/api/v1/video/quote for live video pricing. It can return quotes without auth for many public pricing inputs; if auth becomes required, use docs/table extraction and label it.
3. Quote body pattern:
json
{"model":"seedance-2-0-text-to-video","duration":"5s","aspect_ratio":"16:9","resolution":"720p","audio":true}
4. Normalize video prices to $/second and also report the actual quoted duration, because Venice durations are model-specific enums.
Reference note: see references/venice-ai-api-pricing-2026.md for a live May 2026 comparison snapshot and reusable quote commands.
Reference note: see references/venice-reference-capabilities-2026.md for a live May 2026 snapshot comparing Venice, fal, and Replicate on image-edit references, video R2V references, private/uncensored model posture, and accepted clip-duration enums.
For browser extraction:
1. browser_navigate(url)
2. browser_snapshot(full=true)
3. browser_console(expression='document.body.innerText.slice(0,12000)')
4. extract only the visible, grounded rows and links
Verified directly from fal.ai/pricing using browser tools:
Video:
- Wan 2.5 — $0.05 / second
- Kling 2.5 Turbo Pro — $0.07 / second
- Veo 3 — $0.4 / second
- Ovi — $0.2 / video
Image:
- Seedream V4 — $0.03 / image
- Flux Kontext Pro — $0.04 / image
- Nanobanana — $0.0398 / image
- Qwen — $0.02 / megapixel
Compute:
- H100 — $1.89/hr / $0.0005/s
- H200 — $2.10/hr / $0.0006/s
- A100 — $0.99/hr / $0.0003/s
Verified directly from replicate.com/pricing using browser tools:
Public model examples:
- black-forest-labs/flux-1.1-pro — $0.04 / output image
- black-forest-labs/flux-dev — $0.025 / output image
- black-forest-labs/flux-schnell — $3.00 / thousand output images
- ideogram-ai/ideogram-v3-quality — $0.09 / output image
- recraft-ai/recraft-v3 — $0.04 / output image
- wavespeedai/wan-2.1-i2v-480p — $0.09 / second of output video
- wavespeedai/wan-2.1-i2v-720p — $0.25 / second of output video
Hardware pricing:
- gpu-a100-large — $0.001400/sec / $5.04/hr
- gpu-h100 — $0.001525/sec / $5.49/hr
- gpu-l40s — $0.000975/sec / $3.51/hr
- gpu-t4 — $0.000225/sec / $0.81/hr
Verified May 2026 via direct HTTP fetches of docs.venice.ai/overview/pricing and live POST /api/v1/video/quote calls.
Image examples:
- qwen-image — $0.01 / image
- grok-imagine-image — $0.03 / image (private)
- flux-2-pro — $0.04 / image
- qwen-image-2 — $0.05 / image
- seedream-v4 / seedream-v5-lite — $0.05 / image
- flux-2-max — $0.09 / image
- qwen-image-2-pro — $0.10 / image
- nano-banana-2 — $0.10–$0.19 / image depending resolution
- nano-banana-pro — $0.18–$0.35 / image depending resolution
Video quote examples (normalize but also report source quote):
- kling-2.5-turbo-pro-text-to-video — $0.39 / 5s ≈ $0.078/s
- kling-v3-pro-text-to-video — $0.49 / 4s audio off ≈ $0.1225/s; $0.74 / 4s audio on ≈ $0.185/s
- seedance-2-0-text-to-video 720p — $0.72 / 4s ≈ $0.18/s
- seedance-2-0-fast-text-to-video 720p — $0.58 / 4s ≈ $0.145/s
- veo3.1-fast-text-to-video 720p/1080p — $0.44 / 4s audio off ≈ $0.11/s; $0.66 / 4s audio on ≈ $0.165/s
- veo3.1-full-text-to-video 720p/1080p — $0.88 / 4s audio off ≈ $0.22/s; $1.76 / 4s audio on ≈ $0.44/s
- wan-2-7-text-to-video — $0.55 / 5s at 720p ≈ $0.11/s; $0.70 / 5s at 1080p ≈ $0.14/s
- ltx-2-v2-3-fast-text-to-video 1080p — $0.40 / 6s ≈ $0.0667/s
These came from prior sessions and are useful starting points, but should be re-verified before presenting as current pricing:
Image:
- FLUX: flux-pro/v1.1, flux-pro/v1.1-ultra, flux-2-pro, flux-pro/kontext, flux-pro/kontext/max, flux/dev, flux/dev/image-to-image, flux/schnell, flux-realism, flux-lora, flux-pro/v1/fill
- Qwen: qwen-image-edit, qwen-image-2/edit, and sometimes higher tiers/pro variants
- Nano Banana: nano-banana, nano-banana-2, nano-banana-2/edit, nano-banana-pro, nano-banana-pro/edit
- Ideogram, Recraft, Seedream, Reve, Imagen (availability varies)
Video / FLF candidates: - Wan family - Kling family - Veo family - MiniMax / Hailuo family - Vidu family - Ovi
Image: - Qwen Image / Qwen Image 2 / Qwen Image 2 Pro - FLUX.2 Pro / FLUX.2 Max - Nano Banana 2 / Nano Banana Pro - Seedream v4/v5, Recraft, Grok Imagine, GPT Image variants
Video: - Kling 2.5 / Kling 3 / Kling O3 - Veo 3 / Veo 3.1 fast/full - Seedance 2.0 / fast / reference-to-video - Wan 2.5 / 2.6 / 2.7 - LTX 2.x, HappyHorse, PixVerse, Runway, Vidu, Ovi, Grok Imagine
Image: - FLUX 1.1 Pro / Dev / Schnell - Recraft - Ideogram - Stable Diffusion 3.5 (availability can change)
Video: - Wan family - Kling family - Veo family - Luma / Ray family - Sora family - Hailuo / MiniMax family - CogVideoX / LTX / Grok depending on current catalog
When maintaining a local picker/config such as src/models.js:
For fal-studio, the model registry lives in:
- src/models.js
For fal.ai model capabilities, prefer the endpoint OpenAPI spec and llms.txt over marketing copy.
Reliable pattern:
1. Open pricing/model page to discover likely endpoint IDs.
2. Pull OpenAPI JSON directly:
- https://fal.ai/api/openapi/queue/openapi.json?endpoint_id=ENDPOINT_ID
3. Inspect:
- paths[*].post.requestBody.content.application/json.schema
- components.schemas.*Input
- components.schemas.*Output
4. Optionally read:
- https://fal.ai/models/ENDPOINT_ID/llms.txt
Why this matters:
- the OpenAPI file gives the exact request fields, required params, enums, defaults, and output shape
- llms.txt often adds practical notes like prompt syntax (@Image1, @Image2) and pricing examples
These endpoint families were confirmed via OpenAPI + llms.txt and are useful for building model-specific UI:
fal-ai/wan-25-preview/text-to-videosupports: resolution, duration, aspect_ratio, audio_url, negative_prompt, seed, enable_prompt_expansion, enable_safety_checker
fal-ai/wan-25-preview/image-to-video
prompt, image_urlalso supports audio_url, resolution, duration, negative_prompt, seed, enable_prompt_expansion, enable_safety_checker
fal-ai/wan-flf2v
prompt, start_image_url, end_image_urlsupports num_frames, frames_per_second, resolution, aspect_ratio, guide_scale, num_inference_steps, acceleration, shift, negative_prompt, seed
fal-ai/kling-video/v2.5-turbo/pro/image-to-video
prompt, image_urltail_image_urlsupports duration, cfg_scale, negative_prompt
fal-ai/kling-video/o3/standard/image-to-video
image_urlprompt or multi_prompt (not both), optional end_image_url, duration, generate_audio, shot_typegood candidate for a separate “multi-shot prompts” section in UI
fal-ai/kling-video/o1/standard/image-to-video
prompt, start_image_urlend_image_urlllms.txt explicitly says prompt can reference frames with @Image1 and @Image2
fal-ai/veo3.1/image-to-video
prompt, image_urlsupports resolution, duration, aspect_ratio, generate_audio, negative_prompt, auto_fix, safety_tolerance, seed
fal-ai/veo3.1/first-last-frame-to-video
prompt, first_frame_url, last_frame_urlsupports resolution, duration, aspect_ratio, generate_audio, negative_prompt, auto_fix, safety_tolerance, seed
fal-ai/veo3/image-to-video
required: prompt, image_url
fal-ai/minimax/hailuo-02/standard/image-to-video
prompt, image_urlend_image_urlsupports duration, resolution, prompt_optimizer
fal-ai/vidu/start-end-to-video
prompt, start_image_url, end_image_urlsupports movement_amplitude, seed
fal-ai/vidu/reference-to-video
prompt, reference_image_urlsaspect_ratio, movement_amplitude, seedthis is the clearest confirmed example of a provider model that wants a general multi-image reference section rather than first/last frame fields
fal-ai/ovi/image-to-video
prompt, image_urlnum_inference_steps, negative_prompt, audio_negative_prompt, seedDo not use one generic “reference image” uploader for every model. Use separate capability-driven sections: - General reference images - First frame - Last frame - Optional special sections such as multi-shot prompts
And drive them from model metadata, not hardcoded string checks.
Recommended capability flags per model:
- mediaKind: image or video
- supportsPrompt
- supportsGeneralImageRefs
- generalImageRefField
- minGeneralImageRefs
- maxGeneralImageRefs
- supportsFirstFrame
- firstFrameField
- supportsLastFrame
- lastFrameField
- supportsAudioReference
- audioField
- supportsMultiPrompt
- multiPromptField
- supportsTailFrame if provider uses a nonstandard end-frame field like tail_image_url
Important implementation note:
- first/last frame sections should be distinct from general image reference sections
- if a model supports only first frame, show only that section
- if it supports general refs (like reference_image_urls), show a multi-upload reference section
- if the provider uses nonstandard field names (tail_image_url, first_frame_url, start_image_url), map them explicitly in metadata instead of branching on substring matches in backend code
When reporting back to the user, use:
Live-verified nowHistorical notes to verifyRecommended models by use caseApp inventory gaps / stale entriesVariable; use the Venice Video Quote API and normalize to $/second instead of guessing from the table.https://api.venice.ai/api/v1/models may return only text models when fetched unauthenticated even though docs list image/video models; do not conclude image/video are unavailable from that endpoint alone.401 Authentication is required means the user must manually activate a model. In fal-studio testing, the same 401 message appeared across many unrelated fal endpoints when the saved key was invalid/test-placeholder. First verify the saved key against multiple models before blaming model gating.web_extract here when Firecrawl credits are exhausted.https://rest.alpha.fal.ai/account/billing?expand=credits returned 404 in live testing here, so implement credit lookup with endpoint fallbacks rather than assuming a single fixed path.https://fal.ai/api/openapi/queue/openapi.json?endpoint_id=.... This exposes the exact request fields, required uploads, enums, and output types even when the marketing page is incomplete.llms.txt pages are useful to recover human-readable pricing notes and prompt conventions (for example Kling O1's @Image1 / @Image2 references), but the OpenAPI schema should be treated as the source of truth for app wiring.multi_prompt) and may not require the normal text prompt when the alternate field is present. UI validation should account for that.reference-to-video uses reference_image_urls; a 9-frame storyboard/grid image can still be surfaced as a separate UI section and merged into that array on submit.Use these when evaluating models for Alex's video-story app.
Video Story derives clip lengths from audio timing and needs sub-second durations.
Live-verified fit:
- replicate.com/lucataco/wan-2.2-first-last-frame shows duration_seconds as a number with minimum 0.5 and maximum 10.
- This makes the current Replicate Wan 2.2 first/last-frame model a strong fit for Video Story's timing model.
Live-verified fal limitations:
- fal-ai/wan-25-preview/text-to-video and /image-to-video expose duration enums of only 5 or 10
- fal-ai/kling-video/v2.5-turbo/pro/image-to-video exposes only 5 or 10
- fal-ai/veo3.1/first-last-frame-to-video exposes 4s, 6s, 8s
- fal-ai/minimax/hailuo-02/standard/image-to-video exposes 6 or 10
- fal-ai/kling-video/o1/standard/image-to-video and o3/standard/image-to-video allow more durations, but still only integer-second enums
Practical rule:
- If the app must preserve audio-aligned durations like 0.5s, 1.5s, 2.5s, etc., most current fal video endpoints are a bad direct fit without padding, retiming, or export-time trimming hacks.
- For Video Story, treat Replicate Wan 2.2 FLF as the current benchmark unless a newer model is verified to accept numeric fractional durations.
Live-verified examples:
- fal-ai/flux-pro/kontext OpenAPI requires image_url
- fal-ai/qwen-image requires only prompt
- fal-ai/qwen-image-2/edit and fal-ai/qwen-image-2/pro/edit require prompt + image_urls
- fal-ai/nano-banana/edit, nano-banana-2/edit, and nano-banana-pro/edit require prompt + image_urls
- fal-ai/flux-2-pro/edit requires prompt + image_urls
Practical rule:
- Do not assume a model marketed as "can generate new images from text" is a drop-in text-only endpoint for production. Verify whether the schema actually requires an image field.
- For Video Story, flux-pro/kontext is best treated as an edit model, not a pure text-only reference generator.
Separate decisions into two different jobs:
Live-verified candidates:
- fal-ai/qwen-image — text-only, $0.02 / megapixel
- fal-ai/bytedance/seedream/v4/text-to-image — text-only, $0.03 / image
- fal-ai/nano-banana — text-only, about $0.039 / image
- replicate.com/black-forest-labs/flux-1.1-pro — text-only with optional composition guidance, $0.04 / output image
- replicate.com/black-forest-labs/flux-2-pro — text-only or multi-ref, pricing is mixed: $0.015 / run + $0.015 / input MP + $0.015 / output MP
Live-verified candidates:
- replicate.com/black-forest-labs/flux-2-pro — up to 8 reference images on API
- fal-ai/flux-2-pro/edit — requires image_urls, priced from $0.03 for first output MP plus extra MP charges
- fal-ai/qwen-image-2/edit — requires image_urls with 1-3 images, $0.035 / image
- fal-ai/qwen-image-2/pro/edit — requires image_urls with 1-3 images, $0.075 / image
- fal-ai/nano-banana/edit — requires image_urls, about $0.039 / image
- fal-ai/nano-banana-2/edit — requires image_urls, $0.08 / image base with resolution/thinking surcharges
- fal-ai/nano-banana-pro/edit — requires image_urls, $0.15 / image base
- fal-ai/flux-pro/kontext / kontext/max — single-image edit flow, not multi-ref compositing
The current app implementation is not just "Qwen" in one uniform sense:
- text-only references go through fal-ai/qwen-image
- reference-guided shots go through fal-ai/qwen-image-2/pro/edit
This matters because:
- pricing is different between the text-only and edit legs
- ref-count limits are different from some local registry assumptions
- quality/capability discussions should distinguish qwen-image from qwen-image-2/pro/edit
If the user asks which models are best for Video Story specifically, use this default reasoning: - Best current video fit: Replicate Wan 2.2 FLF, because fractional duration support beats today's fal video enums - Best budget text-only references: fal Qwen Image - Best likely upgrade candidate for pure text-only references: fal Seedream V4 - Best current multi-reference frame candidate on paper: Replicate FLUX.2 Pro (8 refs) or fal Nano Banana 2 / Pro when very high reference count matters - Best single-reference edit model: fal Flux Kontext / Kontext Max - Do not recommend a fal video model as the primary Video Story generator unless its duration API is verified to support numeric fractional timing
If you use this skill and discover renamed endpoints, broken pricing assumptions, or new provider pages, patch this skill immediately and update references/inventory.md.