/audio/speech)POST /api/v1/audio/speech converts text to an audio stream or file. OpenAI-compatible — the OpenAI SDK's audio.speech.create() works as a drop-in.
For music generation (lyrics + instrumental), see venice-audio-music. For transcription (audio → text), see venice-audio-transcription.
curl https://api.venice.ai/api/v1/audio/speech \
-H "Authorization: Bearer $VENICE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "tts-xai-v1",
"voice": "eve",
"input": "Hello, welcome to Venice Voice.",
"response_format": "mp3",
"speed": 1.0,
"streaming": false
}' --output hello.mp3
Response is the raw audio (Content-Type matches response_format).
| Field | Type | Default | Notes |
|---|---|---|---|
input |
string | — | Required. Up to 4096 characters. |
model |
enum | tts-kokoro (OpenAPI schema default) |
See model list below. tts-xai-v1 is the recommended frontier default; pick the model that fits your voice + language needs. |
voice |
enum | model-specific (e.g. eve for tts-xai-v1) |
Voice is model-specific — wrong combo = 400. See voice families. |
response_format |
mp3 / opus / aac / flac / wav / pcm |
mp3 |
pcm returns 24 kHz signed-16 LE for pipelines. |
speed |
number | 1.0 |
Range 0.25–4.0. |
streaming |
bool | false |
true → streamed sentence-by-sentence as audio continues to generate. |
language |
string | — | Optional hint. Accepted form depends on model (Qwen 3 = full names like English; xAI / ElevenLabs = ISO 639-1 like en; MiniMax = full names). Unsupported values silently ignored. |
prompt |
string, ≤ 500 | — | Emotion / style cue. Only for models with supportsPromptParam (Qwen 3 currently). Examples: "Very happy.", "Sad and slow.". |
temperature |
0–2 | — | Sampling temperature. Only for models with supportsTemperatureParam (Qwen 3, Orpheus, Chatterbox HD). |
top_p |
0–1 | — | Only Qwen 3 currently. |
| Model ID | Family | Highlights |
|---|---|---|
tts-xai-v1 |
xAI | Recommended default. Conversational style, ISO 639-1 language hints. |
tts-kokoro |
Kokoro | OpenAPI schema default. Multilingual, many voices across languages. |
tts-qwen3-0-6b / tts-qwen3-1-7b |
Qwen 3 | Emotion control via prompt, temperature, top_p. |
tts-inworld-1-5-max |
Inworld | Character-driven voices (Craig, Ashley, …). |
tts-chatterbox-hd |
Chatterbox | HD voices (Aurora, Blade, …), temperature. |
tts-orpheus |
Orpheus | Conversational (tara, leah, jess, leo, …), temperature. |
tts-elevenlabs-turbo-v2-5 |
ElevenLabs Turbo | Rachel, Aria, Charlotte, Roger, … |
tts-minimax-speech-02-hd |
MiniMax | WiseWoman, DeepVoiceMan, … |
tts-gemini-3-1-flash |
Gemini Flash | Star-named voices (Achernar, Achird, Zephyr, …). |
Always inspect the entry for your model in GET /models?type=tts — model_spec.voices is the authoritative voice list. Per-model toggles like supportsPromptParam, supportsTemperatureParam, supportsTopPParam live on the internal model definitions but are not currently exposed on /models — treat the request schema below (instructions, temperature, top_p) as the support matrix.
af_*, am_* — American female / malebf_*, bm_* — British female / malezf_*, zm_* — Chineseff_*, hf_*, hm_*, if_*, im_*, jf_*, jm_*, pf_*, pm_*, ef_*, em_* — French, Hindi, Italian, Japanese, Portuguese, Spanishaf_sky, af_bella, am_adam, bm_george, zf_xiaoxiaoVivian, Serena, Ono_Anna, Sohee, Uncle_Fu, Dylan, Eric, Ryan, Aideneve, ara, rex, sal, leotara, leah, jess, mia, zoe, dan, zacCraig, Ashley, Olivia, Sarah, Elizabeth, Priya, Alex, Edward, Theodore, Ronald, Mark, Hades, Luna, PixieAurora, Britney, Siobhan, Vicky, Blade, Carl, Cliff, Richard, RicoRachel, Aria, Laura, Charlotte, Alice, Matilda, Jessica, Lily, Roger, Charlie, George, Callum, River, Liam, Will, Chris, Brian, Daniel, BillWiseWoman, FriendlyPerson, InspirationalGirl, CalmWoman, LivelyGirl, LovelyGirl, SweetGirl, ExuberantGirl, DeepVoiceMan, CasualGuy, PatientMan, YoungKnight, DeterminedMan, ImposingManner, ElegantManAchernar, Achird, Algenib, Algieba, Alnilam, Aoede, Autonoe, Callirrhoe, Charon, Despina, Enceladus, Erinome, Fenrir, Gacrux, Iapetus, Kore, Laomedeia, Leda, Orus, Pulcherrima, Puck, Rasalgethi, Sadachbia, Sadaltager, Schedar, Sulafat, Umbriel, Vindemiatrix, Zephyr, ZubenelgenubiPass a voice that isn't in the chosen model's list and you get 400.
{
"model": "tts-xai-v1",
"voice": "eve",
"input": "Hello, this is a long document to narrate. ...",
"streaming": true,
"response_format": "mp3"
}
With streaming: true, the HTTP body is a chunked audio stream. Decode as it arrives — useful for latency-sensitive UIs. response_format: pcm pairs well with browser Web Audio API for raw playback.
import OpenAI from 'openai'
import fs from 'node:fs/promises'
const client = new OpenAI({
apiKey: process.env.VENICE_API_KEY,
baseURL: 'https://api.venice.ai/api/v1',
})
const mp3 = await client.audio.speech.create({
model: 'tts-xai-v1',
voice: 'eve',
input: 'Hello from Venice.',
response_format: 'mp3',
})
await fs.writeFile('hello.mp3', Buffer.from(await mp3.arrayBuffer()))
{
"model": "tts-qwen3-1-7b",
"voice": "Vivian",
"input": "We did it!",
"prompt": "Excited and energetic.",
"temperature": 0.9,
"top_p": 0.95
}
For other families, emotion comes from the voice choice itself (e.g. Inworld Hades vs Pixie). prompt / temperature / top_p are silently ignored.
| Code | Meaning |
|---|---|
400 |
Bad voice/model combo, input too long (>4096), language hint rejected by a strict model, invalid voice for the chosen model. |
401 |
Auth / Pro-only model. |
402 |
Insufficient balance. |
429 |
Rate limited. |
500 / 503 |
Inference / capacity issue — retry with jitter. |
input hard cap is 4096 chars. For books / long content, split on sentence boundaries and concatenate audio client-side.streaming: true + SDKs: some OpenAI SDK versions don't expose streaming for audio.speech.create; call the REST endpoint directly and consume the HTTP body.speed compounds with model internal speech rate — extreme values (0.25, 4.0) often sound unnatural; keep within 0.8–1.3 for narration.eve ≠ EVE, af_sky ≠ AF_SKY).