venice-audio-transcription

/home/avalon/.hermes/skills/venice/venice-audio-transcription/SKILL.md · raw

Venice Transcription (/audio/transcriptions)

POST /api/v1/audio/transcriptions takes an audio file and returns text. It's OpenAI-compatible with multipart/form-data — the OpenAI SDK's audio.transcriptions.create() works unchanged.

Use when

For long video / YouTube transcription, see venice-video's /video/transcriptions (takes a public video URL directly).

Minimal request

curl https://api.venice.ai/api/v1/audio/transcriptions \
  -H "Authorization: Bearer $VENICE_API_KEY" \
  -F "file=@./meeting.m4a" \
  -F "model=nvidia/parakeet-tdt-0.6b-v3" \
  -F "response_format=json" \
  -F "timestamps=false"
{ "text": "Alright everyone, let's kick off the meeting..." }

With timestamps=true, json format also returns segment/word timings (schema is model-specific).

Request (multipart/form-data)

Field Type Default Notes
file binary Required. Audio file. Supported: wav, wave, flac, m4a, aac, mp4, mp3, ogg, webm. Base64 is not accepted — upload as a real file.
model enum nvidia/parakeet-tdt-0.6b-v3 See models below.
response_format json / text json text returns text/plain body.
timestamps bool false Include segment/word timestamps (JSON only).
language string ISO 639-1 hint (e.g. en, ja). Only Whisper-family models honor it; others auto-detect.

Models

Model ID Notes
nvidia/parakeet-tdt-0.6b-v3 Default. Fast, English-first, great for real-time-ish flows.
openai/whisper-large-v3 Large multilingual, honors language hint.
fal-ai/wizper Whisper variant, competitive on quality/latency tradeoff.
elevenlabs/scribe-v2 ElevenLabs Scribe, strong on noisy audio.
stt-xai-v1 xAI Speech-to-Text.

GET /models?type=asr returns the current catalog. ASR pricing is pricing.per_audio_second.usd — cost scales with audio duration.

OpenAI SDK

import OpenAI from 'openai'
import fs from 'node:fs'

const client = new OpenAI({
  apiKey: process.env.VENICE_API_KEY,
  baseURL: 'https://api.venice.ai/api/v1',
})

const out = await client.audio.transcriptions.create({
  file: fs.createReadStream('meeting.m4a'),
  model: 'openai/whisper-large-v3',
  response_format: 'json',
  language: 'en',
  // @ts-expect-error — Venice-specific extra, passes through multipart
  timestamps: true,
})

console.log(out.text)

Batch / long files

Venice doesn't expose native chunking. For files > ~30 min, split client-side on silence with ffmpeg or pydub, transcribe each chunk, then concatenate with offset timestamps.

ffmpeg -i long.mp3 -f segment -segment_time 600 -c copy chunk_%03d.mp3

Errors

Code Meaning
400 Bad params, unsupported audio format, empty file, or file larger than 25 MB (this endpoint returns 400 with "Maximum size is 25MB", not 413).
401 Auth / Pro-only.
402 Insufficient balance.
415 Wrong Content-Type — must be multipart/form-data.
422 Validation / upstream ASR error (e.g. zero-length audio, upstream provider 422). Not a "content policy" code on this path.
429 Rate limited.
500 / 503 Transient; retry with jitter.

Gotchas

VPS fallback transcription

When local Whisper cannot be installed (no GPU, no disk space, PEP 668 restriction), see references/vps-fallback-transcription.md for a session-tested curl-based workflow using Venice's whisper-large-v3.