--- name: venice-audio-transcription description: Transcribe audio files to text via POST /audio/transcriptions. Covers supported models (Parakeet, Whisper, Wizper, Scribe, xAI STT), supported formats (wav/flac/m4a/aac/mp4/mp3/ogg/webm), response formats (json/text), timestamps, and language hints. OpenAI-compatible multipart. --- # Venice Transcription (`/audio/transcriptions`) `POST /api/v1/audio/transcriptions` takes an audio file and returns text. It's OpenAI-compatible with `multipart/form-data` — the OpenAI SDK's `audio.transcriptions.create()` works unchanged. ## Use when - You need STT (speech-to-text) for voice notes, meetings, podcasts, short audio. - You need timestamps for subtitles / chapters. - You want to pick between fast local-style models (Parakeet) and large multilingual ones (Whisper, Wizper, Scribe). For long video / YouTube transcription, see [`venice-video`](../venice-video/SKILL.md)'s `/video/transcriptions` (takes a public video URL directly). ## Minimal request ```bash curl https://api.venice.ai/api/v1/audio/transcriptions \ -H "Authorization: Bearer $VENICE_API_KEY" \ -F "file=@./meeting.m4a" \ -F "model=nvidia/parakeet-tdt-0.6b-v3" \ -F "response_format=json" \ -F "timestamps=false" ``` ```json { "text": "Alright everyone, let's kick off the meeting..." } ``` With `timestamps=true`, `json` format also returns segment/word timings (schema is model-specific). ## Request (`multipart/form-data`) | Field | Type | Default | Notes | |---|---|---|---| | `file` | binary | — | **Required.** Audio file. Supported: `wav`, `wave`, `flac`, `m4a`, `aac`, `mp4`, `mp3`, `ogg`, `webm`. Base64 is **not** accepted — upload as a real file. | | `model` | enum | `nvidia/parakeet-tdt-0.6b-v3` | See models below. | | `response_format` | `json` / `text` | `json` | `text` returns `text/plain` body. | | `timestamps` | bool | `false` | Include segment/word timestamps (JSON only). | | `language` | string | — | ISO 639-1 hint (e.g. `en`, `ja`). Only Whisper-family models honor it; others auto-detect. | ## Models | Model ID | Notes | |---|---| | `nvidia/parakeet-tdt-0.6b-v3` | Default. Fast, English-first, great for real-time-ish flows. | | `openai/whisper-large-v3` | Large multilingual, honors `language` hint. | | `fal-ai/wizper` | Whisper variant, competitive on quality/latency tradeoff. | | `elevenlabs/scribe-v2` | ElevenLabs Scribe, strong on noisy audio. | | `stt-xai-v1` | xAI Speech-to-Text. | `GET /models?type=asr` returns the current catalog. ASR pricing is `pricing.per_audio_second.usd` — cost scales with audio duration. ## OpenAI SDK ```ts import OpenAI from 'openai' import fs from 'node:fs' const client = new OpenAI({ apiKey: process.env.VENICE_API_KEY, baseURL: 'https://api.venice.ai/api/v1', }) const out = await client.audio.transcriptions.create({ file: fs.createReadStream('meeting.m4a'), model: 'openai/whisper-large-v3', response_format: 'json', language: 'en', // @ts-expect-error — Venice-specific extra, passes through multipart timestamps: true, }) console.log(out.text) ``` ## Batch / long files Venice doesn't expose native chunking. For files > ~30 min, split client-side on silence with `ffmpeg` or `pydub`, transcribe each chunk, then concatenate with offset timestamps. ```bash ffmpeg -i long.mp3 -f segment -segment_time 600 -c copy chunk_%03d.mp3 ``` ## Errors | Code | Meaning | |---|---| | `400` | Bad params, unsupported audio format, empty file, or **file larger than 25 MB** (this endpoint returns `400` with `"Maximum size is 25MB"`, not `413`). | | `401` | Auth / Pro-only. | | `402` | Insufficient balance. | | `415` | Wrong `Content-Type` — must be `multipart/form-data`. | | `422` | Validation / upstream ASR error (e.g. zero-length audio, upstream provider 422). Not a "content policy" code on this path. | | `429` | Rate limited. | | `500` / `503` | Transient; retry with jitter. | ## Gotchas - `file` must be uploaded as a real multipart file part. JSON + base64 is **not** supported here. - Timestamps are only surfaced in the JSON response shapes (`json`, `verbose_json`, `srt`, `vtt`). With `response_format: text` the handler returns a plain `text/plain` body containing just the transcript — you'll lose any timestamp data, so pick `verbose_json` / `srt` / `vtt` when you need timings. - `language` is Whisper-specific. Parakeet / Scribe ignore it and auto-detect. - Peak concurrency limits apply — on `429`, back off; big batches should throttle to ~5 parallel requests. - Content-policy rejection on the transcript is returned as `422` with an error string; it does not surface `suggested_prompt` on this path. - **Python `requests` + env vars**: In `execute_code` contexts, `os.environ.get("VENICE_API_KEY")` may return `None` even when the shell has the variable. Pass the key explicitly or read it from a known file (e.g., `.env`). A `401` from `requests.post(..., files=...)` with multipart data often means the key wasn't injected, not that the key is invalid. - **Long-running transcription**: Venice can take several minutes for 20–30 min audio files. The `requests`/`curl` call may exceed a 60 s tool timeout. Use `terminal(background=true)` with `curl` writing to a file, then poll for completion, or use `requests` inside `execute_code` with a long `timeout` (e.g., 300–600 s). - **25 MB file size limit**: This is a hard limit. A ~1h 47m MP3 at 32 kbps was 26 MB and had to be split. Use `ffmpeg -i input.mp3 -t -c copy part_a.mp3` and `-ss -c copy part_b.mp3` to split without re-encoding. ## VPS fallback transcription When local Whisper cannot be installed (no GPU, no disk space, PEP 668 restriction), see [`references/vps-fallback-transcription.md`](references/vps-fallback-transcription.md) for a session-tested curl-based workflow using Venice's whisper-large-v3.