venice-embeddings

/home/avalon/.hermes/skills/venice/venice-embeddings/SKILL.md · raw

Venice Embeddings

POST /api/v1/embeddings returns vector embeddings for strings. It's OpenAI-compatible: the request and response match https://api.openai.com/v1/embeddings closely enough that the OpenAI SDK works out of the box with baseURL: "https://api.venice.ai/api/v1".

Use when

You're building retrieval / RAG / similarity search.
You need text clustering, classification, deduplication, or reranking.
You want Venice's "no-training, no-retention" stance on inference inputs — embeddings are generated and returned; the API does not publish E2EE semantics on /embeddings the way it does on selected chat models.

Text-only. For image/multimodal signals, either run images through a vision chat model and embed the description, or pick a multimodal-capable embedding model from GET /models?type=embedding (the catalog changes; inspect model_spec on each row).

Minimal request

curl https://api.venice.ai/api/v1/embeddings \
  -H "Authorization: Bearer $VENICE_API_KEY" \
  -H "Content-Type: application/json" \
  -H "Accept-Encoding: gzip, br" \
  -d '{
    "model": "text-embedding-bge-m3",
    "input": "Why is the sky blue?"
  }'

{
  "object": "list",
  "model": "text-embedding-bge-m3",
  "data": [
    { "object": "embedding", "index": 0, "embedding": [0.0023, -0.0093, 0.0158, ...] }
  ],
  "usage": { "prompt_tokens": 8, "total_tokens": 8 }
}

Request schema

Field	Type	Notes
`model`	string	Required. Model ID from `GET /models?type=embedding`.
`input`	string \| string[] \| number[] \| number[][]	Required. Single string, array of strings (≤ 2048 entries), or pre-tokenized arrays.
`encoding_format`	`"float"` \| `"base64"`	Default `"float"`. Use `"base64"` for ~4× payload shrinkage; decode client-side.
`dimensions`	integer	Optional. Truncate output dimensions. Only meaningful when the model's `model_spec.supportsCustomDimensions === true` — behavior on non-supporting models is model-dependent; test a small call before relying on it.
`user`	string	Accepted for OpenAI compat. Discarded by Venice.

input max tokens per string is capped at the model's model_spec.maxInputTokens (typically 8192). Batch arrays are capped at 2048 items. Venice returns one embedding per element, in order, with matching index.

Response headers & compression

Request Accept-Encoding: gzip, br. The response will include Content-Encoding accordingly. For long batches this matters — vectors are large.

For x402 auth, X-Balance-Remaining reports your remaining USDC credits.

Using the OpenAI SDK

import OpenAI from 'openai'

const client = new OpenAI({
  apiKey: process.env.VENICE_API_KEY,
  baseURL: 'https://api.venice.ai/api/v1',
})

const res = await client.embeddings.create({
  model: 'text-embedding-bge-m3',
  input: ['first doc', 'second doc'],
})

const vec0 = res.data[0].embedding

Batch-embedding pattern

async function embedBatch(texts: string[], batchSize = 64) {
  const out: number[][] = []
  for (let i = 0; i < texts.length; i += batchSize) {
    const slice = texts.slice(i, i + batchSize)
    const res = await client.embeddings.create({
      model: 'text-embedding-bge-m3',
      input: slice,
      encoding_format: 'float',
    })
    for (const row of res.data) out[i + row.index] = row.embedding
  }
  return out
}

Keep batches ≤ model context limit total tokens.
On 429, back off exponentially and halve the batch — see venice-errors.

Choosing a model

Query GET /models?type=embedding for the current catalog. Each entry exposes:

model_spec.embeddingDimensions — native output dimension (e.g. 1024 for BGE-M3).
model_spec.maxInputTokens — max tokens per input string.
model_spec.supportsCustomDimensions — whether dimensions can truncate the output.
model_spec.pricing.input.usd / .diem — cost per million input tokens.

Built-in options include text-embedding-bge-m3, text-embedding-bge-en-icl, text-embedding-qwen3-8b, text-embedding-qwen3-0-6b, text-embedding-multilingual-e5-large-instruct, text-embedding-3-small, text-embedding-3-large, gemini-embedding-2-preview, text-embedding-nemotron-embed-vl-1b-v2.

Always pin the model ID — cosine distances are not comparable across different embedding models.

Error handling

Code	Meaning
`400`	Validation error. Check `details` in the response for the exact field.
`401`	Auth / Pro-only model.
`402`	Insufficient balance. Bearer → `INSUFFICIENT_BALANCE`. x402 → structured `PAYMENT_REQUIRED`.
`415`	Wrong `Content-Type` — must be `application/json`.
`429`	Rate limited.
`500`	Inference failed; retry with jitter.
`503`	Model at capacity; retry later.

Gotchas

dimensions is only meaningful when model_spec.supportsCustomDimensions === true. Behavior on other models is model-dependent — test with a small request before relying on it.
input must not be empty; Venice rejects empty strings with 400.
Whether the returned vectors are L2-normalized depends on the model — verify with Math.hypot(...v) ≈ 1 before assuming.
For RAG, store model alongside the vector so you can re-embed on upgrade.