venice-chat

/home/avalon/.hermes/skills/venice/venice-chat/SKILL.md · raw

Venice Chat Completions

POST /api/v1/chat/completions is Venice's main text endpoint. It's OpenAI-compatible, plus a venice_parameters object for Venice-only features.

Use when

You need LLM text generation, with or without tools, with or without streaming.
You want multimodal inputs (images, audio, video) to a vision/audio-capable model.
You want Venice-specific features: web search, E2EE, characters, xAI X/Twitter search, strip-thinking, web scraping.
You need prompt caching for large system prompts or long documents.
You need structured (json_schema) output.

For the newer Alpha Responses API, see venice-responses.

Minimal request

curl https://api.venice.ai/api/v1/chat/completions \
  -H "Authorization: Bearer $VENICE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org-glm-5-1",
    "messages": [{"role": "user", "content": "Why is the sky blue?"}]
  }'

Response shape is the standard OpenAI chat.completion object (id, object: "chat.completion", choices[].message, usage). With stream: true, responses come as SSE data: lines in chat.completion.chunk format.

The request body

Core fields (OpenAI-compatible)

Field	Notes
`model`	string — model ID, trait name, or compatibility mapping. Suffixes allowed (see below). Required.
`messages`	array of `system` / `developer` / `user` / `assistant` / `tool` messages. Required, min 1.
`temperature`, `top_p`, `top_k`, `min_p`, `min_temp`, `max_temp`	sampling controls
`repetition_penalty`, `frequency_penalty`, `presence_penalty`	repetition controls
`max_tokens` (deprecated) / `max_completion_tokens`	upper bound on output tokens
`n`	number of choices (keep `1` to minimize cost)
`seed`	integer for reproducibility
`stop` / `stop_token_ids`	up to 4 strings, or raw token IDs
`stream`, `stream_options.include_usage`	SSE streaming + include usage in the final chunk
`response_format`	`{type:"json_schema", json_schema:{...}}` (preferred), `{type:"json_object"}`, or `{type:"text"}`
`tools`, `tool_choice`, `parallel_tool_calls`	function calling / built-in tools
`logprobs`, `top_logprobs`	return token log-probabilities
`reasoning.effort` / `reasoning_effort`	`none` \| `minimal` \| `low` \| `medium` \| `high` \| `xhigh` \| `max`
`reasoning.summary`	`auto` \| `concise` \| `detailed`
`prompt_cache_key`, `prompt_cache_retention` (`default`/`extended`/`24h`)	prompt caching hints
`text.verbosity`	`low`/`medium`/`high`/`auto`
`metadata`	key/value strings for tracking
`user`, `store`	accepted but ignored (OpenAI compat)

`venice_parameters` (Venice-only)

All optional. Combined with model feature suffixes, these are how you enable Venice features.

Field	Type	Default	Effect
`character_slug`	string	—	Apply a published Venice character. Slug is the "Public ID" on the character page. See `venice-characters`.
`strip_thinking_response`	bool	`false`	Strip `<think>...</think>` from the assistant output on reasoning models.
`disable_thinking`	bool	`false`	Disable thinking entirely on supported reasoning models and strip tags.
`enable_e2ee`	bool	`true`	End-to-end encryption on E2EE-capable models when E2EE headers are present. Set to `false` to force TEE-only.
`enable_web_search`	`"off"`/`"auto"`/`"on"`	`"off"`	Venice server-side web search. Citations arrive in the first streamed chunk or the response.
`enable_web_scraping`	bool	`false`	Scrape any URLs found in the last user message (Firecrawl).
`enable_web_citations`	bool	`false`	Ask the LLM to cite sources with `^1^` / `^1,3^` superscripts.
`include_search_results_in_stream`	bool	`false`	Experimental — emit search results as the first stream chunk.
`return_search_results_as_documents`	bool	—	Also surface search results as a synthetic tool call `venice_web_search_documents` (LangChain-friendly).
`include_venice_system_prompt`	bool	`true`	Prepend Venice's curated system prompt. Turn off for full control.
`enable_x_search`	bool	`false`	xAI native web + X/Twitter search (Grok models with `supportsXSearch`). Adds ~$0.01/search.

Model feature suffixes

Some venice_parameters can also be expressed as model feature suffixes on the model string — useful when the caller/library (OpenAI SDK, LangChain) can't set venice_parameters. Syntax:

<model-id>:<key>=<value>[&<key>=<value>…]

Values are URL-decoded. Supported keys (exact match):

Key	Type	Maps to
`enable_web_search`	`on` / `off` / `auto`	`venice_parameters.enable_web_search`
`enable_web_citations`	`"true"` / `"false"`	`venice_parameters.enable_web_citations`
`enable_web_scraping`	`"true"` / `"false"`	`venice_parameters.enable_web_scraping`
`include_venice_system_prompt`	`"true"` / `"false"`	`venice_parameters.include_venice_system_prompt`
`include_search_results_in_stream`	`"true"` / `"false"`	`venice_parameters.include_search_results_in_stream`
`return_search_results_as_documents`	`"true"` / `"false"`	`venice_parameters.return_search_results_as_documents`
`character_slug`	string	`venice_parameters.character_slug`
`strip_thinking_response`	`"true"` / `"false"`	`venice_parameters.strip_thinking_response`
`disable_thinking`	`"true"` / `"false"`	`venice_parameters.disable_thinking`

Unknown keys are silently ignored. Examples:

zai-org-glm-5-1:enable_web_search=on
kimi-k2-6:strip_thinking_response=true&enable_web_search=auto
zai-org-glm-5-1:character_slug=alan-watts

Note: enable_e2ee and enable_x_search can only be set via venice_parameters, not as suffixes.

Messages and modalities

messages[].content is either a string or an array of typed parts. Roles: user, assistant, tool, system, developer (reasoning models like o-series / codex).

Text + image (`image_url`)

{
  "model": "zai-org-glm-5-1",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "What's in this image?"},
      {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
    ]
  }]
}

url accepts a public URL or data:image/png;base64,....
Models with model_spec.capabilities.supportsMultipleImages: true preserve images across the whole conversation; single-image vision models only keep images from the last user message. Check model_spec.capabilities.maxImages for the per-request cap.

Audio input (`input_audio`)

{
  "role": "user",
  "content": [
    {"type": "text", "text": "Transcribe this clip."},
    {"type": "input_audio", "input_audio": {"data": "<base64>", "format": "wav"}}
  ]
}

Formats: wav, mp3, aiff, aac, ogg, flac, m4a, pcm16, pcm24. Audio URLs are not supported — always inline base64.

Video input (`video_url`)

{
  "role": "user",
  "content": [
    {"type": "text", "text": "Summarize this."},
    {"type": "video_url", "video_url": {"url": "https://www.youtube.com/watch?v=..."}}
  ]
}

Accepts public URLs (including YouTube for some providers) or data:video/mp4;base64,.... Supported formats: mp4, mpeg, mov, webm.

Prompt caching (`cache_control`)

Any text / image_url / input_audio / video_url part can carry:

{"cache_control": {"type": "ephemeral", "ttl": "1h"}}

Combine with prompt_cache_key and prompt_cache_retention: "24h" on the root request for predictable cache routing. Cache read / write pricing is model-specific — check model_spec.pricing on /models.

Tools & function calling

Function tools

{
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get current weather for a city",
      "parameters": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"]
      },
      "strict": true
    }
  }],
  "tool_choice": "auto"
}

tool_choice can also be "required", "none", or {"type":"function","function":{"name":"get_weather"}}.
parallel_tool_calls: true (default) lets the model emit multiple calls at once.
Respond by appending {"role":"tool","tool_call_id":"...","content":"..."} before the next call.

Built-in tools

"tools": [{"type": "web_search"}, {"type": "x_search"}]

Equivalent to toggling venice_parameters.enable_web_search / enable_x_search. x_search requires a model with supportsXSearch.

Reasoning models

On thinking models (GLM 5.1, Kimi K2.6, Claude Opus 4.7, GPT-5.4 Pro, …):

{
  "model": "zai-org-glm-5-1",
  "reasoning": {"effort": "medium", "summary": "auto"},
  "venice_parameters": {"strip_thinking_response": false},
  "messages": [...]
}

reasoning_effort is the OpenAI-compatible flat variant (takes precedence over reasoning.effort).
Reasoning models may return reasoning_content or structured reasoning_details[] on the assistant message. Pass reasoning_details back verbatim in the next turn — it encodes thought signatures for providers like Claude Opus 4.7 and GPT-5.4 Pro.
Use venice_parameters.disable_thinking: true to skip thinking entirely on supported models.

Structured output (`response_format`)

{
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "type": "object",
      "properties": {"name": {"type": "string"}, "age": {"type": "number"}},
      "required": ["name", "age"]
    }
  }
}

Prefer json_schema over the legacy json_object. Plain text is the default ({"type": "text"}).

E2EE (end-to-end encryption)

For models advertising supportsE2EE:

Perform an HPKE / Noise handshake with Venice (see docs.venice.ai/e2ee).
Send encrypted payload with the required E2EE request headers.
Leave venice_parameters.enable_e2ee at default true, or set false to fall back to TEE-only.

E2EE is not supported on /responses — use /chat/completions for encrypted inference.

Streaming

{"stream": true, "stream_options": {"include_usage": true}}

Response is text/event-stream. Each event is data: {...chunk...}\n\n, terminated by data: [DONE].
include_usage: true adds a final chunk with token counts.
With venice_parameters.include_search_results_in_stream: true, the first chunk carries venice_search_results.

Web-search answers

When enable_web_search is "auto" or "on", the response includes venice_parameters.web_search_citations[] where each entry has url, title, content (snippet), and date. Turn on enable_web_citations to have the model insert ^1^ superscripts inline.

Error handling specifics

402 — insufficient balance. Bearer: INSUFFICIENT_BALANCE. x402: PAYMENT_REQUIRED with structured topUpInstructions and siwxChallenge.
422 — prompt violates Venice or provider content policy. May include suggested_prompt.
413 — payload too large (mostly vision/audio).
429 — rate limit. See /api_keys/rate_limits and venice-errors.

Common gotchas

max_tokens is deprecated — prefer max_completion_tokens.
Reasoning-capable models can exhaust max_completion_tokens before emitting the visible structured answer—even with venice_parameters.disable_thinking: true. A response may return HTTP 200 with finish_reason: "length" and only a fragment such as {"reply":"Do. For small strict JSON schemas, budget at least ~1,200 completion tokens, check finish_reason, parse fenced/surrounded JSON defensively, and retry malformed or truncated output.
Image URLs must be publicly reachable from Venice's network. Localhost / signed S3 URLs without public access fail.
Audio inputs cannot be URLs — always base64.
Single-image vision models drop older images on each turn; chain them into the last user message.
For multi-turn with tools on Claude Opus 4.7, GPT-5.4 Pro, and similar, always round-trip reasoning_details unchanged.
parallel_tool_calls: true means you MUST be prepared to execute several tools in parallel before sending a single tool-role reply chain.
character_slug replaces the default Venice system prompt. Combine with include_venice_system_prompt: false for total control.