venice-audio-speech

/home/avalon/.hermes/skills/venice/venice-audio-speech/SKILL.md · raw

Venice TTS (`/audio/speech`)

POST /api/v1/audio/speech converts text to an audio stream or file. OpenAI-compatible — the OpenAI SDK's audio.speech.create() works as a drop-in.

Use when

You want narration, voice replies, or UI audio from text.
You need a specific voice family (ElevenLabs, Kokoro, xAI, Qwen 3, Orpheus, Chatterbox, MiniMax, Inworld, Gemini Flash).
You want streaming audio returned sentence-by-sentence.
You need style/emotion control on supported models.

For music generation (lyrics + instrumental), see venice-audio-music. For transcription (audio → text), see venice-audio-transcription.

Minimal request

curl https://api.venice.ai/api/v1/audio/speech \
  -H "Authorization: Bearer $VENICE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-xai-v1",
    "voice": "eve",
    "input": "Hello, welcome to Venice Voice.",
    "response_format": "mp3",
    "speed": 1.0,
    "streaming": false
  }' --output hello.mp3

Response is the raw audio (Content-Type matches response_format).

Request schema

Field	Type	Default	Notes
`input`	string	—	Required. Up to 4096 characters.
`model`	enum	`tts-kokoro` (OpenAPI schema default)	See model list below. `tts-xai-v1` is the recommended frontier default; pick the model that fits your voice + language needs.
`voice`	enum	model-specific (e.g. `eve` for `tts-xai-v1`)	Voice is model-specific — wrong combo = `400`. See voice families.
`response_format`	`mp3` / `opus` / `aac` / `flac` / `wav` / `pcm`	`mp3`	`pcm` returns 24 kHz signed-16 LE for pipelines.
`speed`	number	`1.0`	Range `0.25–4.0`.
`streaming`	bool	`false`	`true` → streamed sentence-by-sentence as audio continues to generate.
`language`	string	—	Optional hint. Accepted form depends on model (Qwen 3 = full names like `English`; xAI / ElevenLabs = ISO 639-1 like `en`; MiniMax = full names). Unsupported values silently ignored.
`prompt`	string, ≤ 500	—	Emotion / style cue. Only for models with `supportsPromptParam` (Qwen 3 currently). Examples: "Very happy.", "Sad and slow.".
`temperature`	0–2	—	Sampling temperature. Only for models with `supportsTemperatureParam` (Qwen 3, Orpheus, Chatterbox HD).
`top_p`	0–1	—	Only Qwen 3 currently.

Models

Model ID	Family	Highlights
`tts-xai-v1`	xAI	Recommended default. Conversational style, ISO 639-1 language hints.
`tts-kokoro`	Kokoro	OpenAPI schema default. Multilingual, many voices across languages.
`tts-qwen3-0-6b` / `tts-qwen3-1-7b`	Qwen 3	Emotion control via `prompt`, temperature, top_p.
`tts-inworld-1-5-max`	Inworld	Character-driven voices (Craig, Ashley, …).
`tts-chatterbox-hd`	Chatterbox	HD voices (Aurora, Blade, …), temperature.
`tts-orpheus`	Orpheus	Conversational (tara, leah, jess, leo, …), temperature.
`tts-elevenlabs-turbo-v2-5`	ElevenLabs Turbo	Rachel, Aria, Charlotte, Roger, …
`tts-minimax-speech-02-hd`	MiniMax	WiseWoman, DeepVoiceMan, …
`tts-gemini-3-1-flash`	Gemini Flash	Star-named voices (Achernar, Achird, Zephyr, …).

Always inspect the entry for your model in GET /models?type=tts — model_spec.voices is the authoritative voice list. Per-model toggles like supportsPromptParam, supportsTemperatureParam, supportsTopPParam live on the internal model definitions but are not currently exposed on /models — treat the request schema below (instructions, temperature, top_p) as the support matrix.

Voice families (by prefix)

Kokoro — lowercase + language/gender prefix:
af_*, am_* — American female / male
bf_*, bm_* — British female / male
zf_*, zm_* — Chinese
ff_*, hf_*, hm_*, if_*, im_*, jf_*, jm_*, pf_*, pm_*, ef_*, em_* — French, Hindi, Italian, Japanese, Portuguese, Spanish
Examples: af_sky, af_bella, am_adam, bm_george, zf_xiaoxiao
Qwen 3 — Vivian, Serena, Ono_Anna, Sohee, Uncle_Fu, Dylan, Eric, Ryan, Aiden
xAI — eve, ara, rex, sal, leo
Orpheus — tara, leah, jess, mia, zoe, dan, zac
Inworld — Craig, Ashley, Olivia, Sarah, Elizabeth, Priya, Alex, Edward, Theodore, Ronald, Mark, Hades, Luna, Pixie
Chatterbox — Aurora, Britney, Siobhan, Vicky, Blade, Carl, Cliff, Richard, Rico
ElevenLabs Turbo — Rachel, Aria, Laura, Charlotte, Alice, Matilda, Jessica, Lily, Roger, Charlie, George, Callum, River, Liam, Will, Chris, Brian, Daniel, Bill
MiniMax — WiseWoman, FriendlyPerson, InspirationalGirl, CalmWoman, LivelyGirl, LovelyGirl, SweetGirl, ExuberantGirl, DeepVoiceMan, CasualGuy, PatientMan, YoungKnight, DeterminedMan, ImposingManner, ElegantMan
Gemini 3 Flash — star names: Achernar, Achird, Algenib, Algieba, Alnilam, Aoede, Autonoe, Callirrhoe, Charon, Despina, Enceladus, Erinome, Fenrir, Gacrux, Iapetus, Kore, Laomedeia, Leda, Orus, Pulcherrima, Puck, Rasalgethi, Sadachbia, Sadaltager, Schedar, Sulafat, Umbriel, Vindemiatrix, Zephyr, Zubenelgenubi

Pass a voice that isn't in the chosen model's list and you get 400.

Streaming

{
  "model": "tts-xai-v1",
  "voice": "eve",
  "input": "Hello, this is a long document to narrate. ...",
  "streaming": true,
  "response_format": "mp3"
}

With streaming: true, the HTTP body is a chunked audio stream. Decode as it arrives — useful for latency-sensitive UIs. response_format: pcm pairs well with browser Web Audio API for raw playback.

OpenAI SDK

import OpenAI from 'openai'
import fs from 'node:fs/promises'

const client = new OpenAI({
  apiKey: process.env.VENICE_API_KEY,
  baseURL: 'https://api.venice.ai/api/v1',
})

const mp3 = await client.audio.speech.create({
  model: 'tts-xai-v1',
  voice: 'eve',
  input: 'Hello from Venice.',
  response_format: 'mp3',
})

await fs.writeFile('hello.mp3', Buffer.from(await mp3.arrayBuffer()))

Emotion / style (Qwen 3 only)

{
  "model": "tts-qwen3-1-7b",
  "voice": "Vivian",
  "input": "We did it!",
  "prompt": "Excited and energetic.",
  "temperature": 0.9,
  "top_p": 0.95
}

For other families, emotion comes from the voice choice itself (e.g. Inworld Hades vs Pixie). prompt / temperature / top_p are silently ignored.

Errors

Code	Meaning
`400`	Bad voice/model combo, input too long (>4096), language hint rejected by a strict model, invalid voice for the chosen model.
`401`	Auth / Pro-only model.
`402`	Insufficient balance.
`429`	Rate limited.
`500` / `503`	Inference / capacity issue — retry with jitter.

Gotchas

input hard cap is 4096 chars. For books / long content, split on sentence boundaries and concatenate audio client-side.
streaming: true + SDKs: some OpenAI SDK versions don't expose streaming for audio.speech.create; call the REST endpoint directly and consume the HTTP body.
speed compounds with model internal speech rate — extreme values (0.25, 4.0) often sound unnatural; keep within 0.8–1.3 for narration.
Voice names are case-sensitive (eve ≠ EVE, af_sky ≠ AF_SKY).