venice-audio-speech

/home/avalon/.hermes/skills/venice/venice-audio-speech/SKILL.md · raw

Venice TTS (/audio/speech)

POST /api/v1/audio/speech converts text to an audio stream or file. OpenAI-compatible — the OpenAI SDK's audio.speech.create() works as a drop-in.

Use when

For music generation (lyrics + instrumental), see venice-audio-music. For transcription (audio → text), see venice-audio-transcription.

Minimal request

curl https://api.venice.ai/api/v1/audio/speech \
  -H "Authorization: Bearer $VENICE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-xai-v1",
    "voice": "eve",
    "input": "Hello, welcome to Venice Voice.",
    "response_format": "mp3",
    "speed": 1.0,
    "streaming": false
  }' --output hello.mp3

Response is the raw audio (Content-Type matches response_format).

Request schema

Field Type Default Notes
input string Required. Up to 4096 characters.
model enum tts-kokoro (OpenAPI schema default) See model list below. tts-xai-v1 is the recommended frontier default; pick the model that fits your voice + language needs.
voice enum model-specific (e.g. eve for tts-xai-v1) Voice is model-specific — wrong combo = 400. See voice families.
response_format mp3 / opus / aac / flac / wav / pcm mp3 pcm returns 24 kHz signed-16 LE for pipelines.
speed number 1.0 Range 0.25–4.0.
streaming bool false true → streamed sentence-by-sentence as audio continues to generate.
language string Optional hint. Accepted form depends on model (Qwen 3 = full names like English; xAI / ElevenLabs = ISO 639-1 like en; MiniMax = full names). Unsupported values silently ignored.
prompt string, ≤ 500 Emotion / style cue. Only for models with supportsPromptParam (Qwen 3 currently). Examples: "Very happy.", "Sad and slow.".
temperature 0–2 Sampling temperature. Only for models with supportsTemperatureParam (Qwen 3, Orpheus, Chatterbox HD).
top_p 0–1 Only Qwen 3 currently.

Models

Model ID Family Highlights
tts-xai-v1 xAI Recommended default. Conversational style, ISO 639-1 language hints.
tts-kokoro Kokoro OpenAPI schema default. Multilingual, many voices across languages.
tts-qwen3-0-6b / tts-qwen3-1-7b Qwen 3 Emotion control via prompt, temperature, top_p.
tts-inworld-1-5-max Inworld Character-driven voices (Craig, Ashley, …).
tts-chatterbox-hd Chatterbox HD voices (Aurora, Blade, …), temperature.
tts-orpheus Orpheus Conversational (tara, leah, jess, leo, …), temperature.
tts-elevenlabs-turbo-v2-5 ElevenLabs Turbo Rachel, Aria, Charlotte, Roger, …
tts-minimax-speech-02-hd MiniMax WiseWoman, DeepVoiceMan, …
tts-gemini-3-1-flash Gemini Flash Star-named voices (Achernar, Achird, Zephyr, …).

Always inspect the entry for your model in GET /models?type=ttsmodel_spec.voices is the authoritative voice list. Per-model toggles like supportsPromptParam, supportsTemperatureParam, supportsTopPParam live on the internal model definitions but are not currently exposed on /models — treat the request schema below (instructions, temperature, top_p) as the support matrix.

Voice families (by prefix)

Pass a voice that isn't in the chosen model's list and you get 400.

Streaming

{
  "model": "tts-xai-v1",
  "voice": "eve",
  "input": "Hello, this is a long document to narrate. ...",
  "streaming": true,
  "response_format": "mp3"
}

With streaming: true, the HTTP body is a chunked audio stream. Decode as it arrives — useful for latency-sensitive UIs. response_format: pcm pairs well with browser Web Audio API for raw playback.

OpenAI SDK

import OpenAI from 'openai'
import fs from 'node:fs/promises'

const client = new OpenAI({
  apiKey: process.env.VENICE_API_KEY,
  baseURL: 'https://api.venice.ai/api/v1',
})

const mp3 = await client.audio.speech.create({
  model: 'tts-xai-v1',
  voice: 'eve',
  input: 'Hello from Venice.',
  response_format: 'mp3',
})

await fs.writeFile('hello.mp3', Buffer.from(await mp3.arrayBuffer()))

Emotion / style (Qwen 3 only)

{
  "model": "tts-qwen3-1-7b",
  "voice": "Vivian",
  "input": "We did it!",
  "prompt": "Excited and energetic.",
  "temperature": 0.9,
  "top_p": 0.95
}

For other families, emotion comes from the voice choice itself (e.g. Inworld Hades vs Pixie). prompt / temperature / top_p are silently ignored.

Errors

Code Meaning
400 Bad voice/model combo, input too long (>4096), language hint rejected by a strict model, invalid voice for the chosen model.
401 Auth / Pro-only model.
402 Insufficient balance.
429 Rate limited.
500 / 503 Inference / capacity issue — retry with jitter.

Gotchas