Deploy GPU inference workloads that scale to zero. Worker spins up on job submit, processes, scales back down.
Web App (VPS) ──POST /run──> RunPod Endpoint ──> Worker (GPU)
│ │ │
│ <──poll /status/{id}── │ downloads │
│ │ from URLs │
│ <──result (video URL)── │ │
The web app on VPS handles file uploads, stores them locally (served via Express static), sends URLs to RunPod. Worker downloads files, processes, uploads result to S3 or returns base64.
Base: https://api.runpod.io/graphql (management) and https://api.runpod.ai/v2/{ENDPOINT_ID} (jobs)
Auth header: Authorization: Bearer {API_KEY}
POST https://api.runpod.ai/v2/{ENDPOINT_ID}/run
{"input": { ... }}
Response: {"id": "job-uuid", "status": "IN_QUEUE"}
GET https://api.runpod.ai/v2/{ENDPOINT_ID}/status/{JOB_ID}
Response: {"id": "...", "status": "COMPLETED", "output": { ... }}
Statuses: IN_QUEUE → IN_PROGRESS → COMPLETED | FAILED | CANCELLED
GET https://api.runpod.ai/v2/{ENDPOINT_ID}/health
# Write query to file to avoid shell escaping hell
cat > /tmp/rp_query.json << 'EOF'
{"query": "mutation { createNetworkVolume(input: { name: \"my-models\", size: 50, dataCenterId: \"US-TX-3\" }) { id name size } }"}
EOF
curl -s -H "Authorization: Bearer $RUNPOD_KEY" \
https://api.runpod.io/graphql \
-H "Content-Type: application/json" \
-d @/tmp/rp_query.json
IMPORTANT: dockerArgs is a REQUIRED field (not optional).
cat > /tmp/rp_template.json << 'EOF'
{"query": "mutation { saveTemplate(input: { name: \"My Worker\", imageName: \"runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04\", dockerArgs: \"bash -c 'if [ ! -d /runpod-volume/handler-repo ]; then git clone https://github.com/USER/REPO.git /runpod-volume/handler-repo; fi && bash /runpod-volume/handler-repo/start.sh'\", containerDiskInGb: 20, volumeInGb: 0, isServerless: true, startJupyter: false, startSsh: false, volumeMountPath: \"/runpod-volume\", env: [{key: \"MODEL_DIR\", value: \"/runpod-volume/models\"}, {key: \"HF_HOME\", value: \"/runpod-volume/huggingface\"}] }) { id name } }"}
EOF
curl -s -H "Authorization: Bearer $RUNPOD_KEY" \
https://api.runpod.io/graphql \
-H "Content-Type: application/json" \
-d @/tmp/rp_template.json
cat > /tmp/rp_endpoint.json << 'EOF'
{"query": "mutation { saveEndpoint(input: { name: \"my-endpoint\", templateId: \"TEMPLATE_ID\", networkVolumeId: \"VOLUME_ID\", gpuIds: \"AMPERE_48\", workersMin: 0, workersMax: 1, idleTimeout: 5, scalerType: \"QUEUE_DELAY\", scalerValue: 4 }) { id name gpuIds workersMin workersMax } }"}
EOF
curl -s -H "Authorization: Bearer $RUNPOD_KEY" \
https://api.runpod.io/graphql \
-H "Content-Type: application/json" \
-d @/tmp/rp_endpoint.json
| Tier | VRAM | Example GPUs | Cost/sec (flex) |
|---|---|---|---|
| AMPERE_24 | 24GB | RTX 4090 | ~$0.00031 |
| AMPERE_48 | 48GB | L40S | ~$0.00053 |
| AMPERE_80 | 80GB | A100 | ~$0.00076 |
import runpod
import os, requests, subprocess
def handler(job):
input_data = job["input"]
# Download input files from URLs
video_path = download(input_data["video_url"], "/tmp/input.mp4")
# Run inference
runpod.serverless.progress_update(job, "Processing...")
result = subprocess.run(["python", "inference.py", ...], ...)
# Return result (URL or base64)
return {"status": "completed", "output_url": "..."}
if __name__ == "__main__":
# Download models on cold start (cached on network volume)
ensure_models()
runpod.serverless.start({"handler": handler})
Handles first-run setup on network volume:
#!/bin/bash
set -e
# System deps
if ! command -v ffmpeg &>/dev/null; then
apt-get update && apt-get install -y ffmpeg libgl1 libglib2.0-0
fi
# Clone ML repo (cached on volume)
if [ ! -d /runpod-volume/MyProject ]; then
git clone https://github.com/ORG/PROJECT.git /runpod-volume/MyProject
fi
# Install deps (cached via flag file)
if [ ! -f /runpod-volume/.deps_installed_v1 ]; then
pip install -r /runpod-volume/MyProject/requirements.txt
pip install runpod boto3 requests huggingface_hub
touch /runpod-volume/.deps_installed_v1
fi
# Update handler (always fresh)
cd /runpod-volume/handler-repo && git pull 2>/dev/null || true
# Run
exec python -u /runpod-volume/handler-repo/handler.py
RunPod does NOT support multipart uploads. Two approaches: 1. URL-based (recommended): Web app stores files locally, serves via Express static, sends URLs to worker. Worker downloads via HTTP. 2. Base64: Encode files in request body. Bad for large files (video).
For results: upload to S3, or return base64 for small files.
curl -d @file. Shell escaping of nested quotes + bash inside dockerArgs inside JSON is impossible to get right inline.flashBoot → use nothing (not a valid field), creditBalance → clientBalance. Test with introspection or trial/error.ConnectionResetError is common. Retry after sleep 3.client_max_body_size 200M; for video upload apps.git clone silently fails (especially with || true). The worker starts but can't find handler.py → exit code 2 + "No such file or directory". NEVER hardcode GitHub tokens in dockerArgs — they're visible in plaintext in the template config. Best solutions (in order of preference):GH_TOKEN) and reference it: git clone https://$GH_TOKEN@github.com/USER/REPO.git ...Common error: worker exited with exit code 2 — usually means the Python handler script wasn't found.
Diagnosis checklist:
1. Check RunPod endpoint logs for the actual error message
2. Verify the dockerArgs or Dockerfile copies/clones the handler to the right path
3. If using git clone in dockerArgs, test the clone URL manually (private repo? auth needed?)
4. The || true pattern in bash masks clone failures — remove it temporarily to see errors
5. Compare the path in the CMD/exec line with where the handler actually ends up
Querying template config via API (to inspect dockerArgs remotely):
{"query": "{ myself { endpoints { id name template { id name imageName dockerArgs env { key value } } } } }"}
Prefer building a Docker image with the handler baked in over cloning repos at runtime in dockerArgs. The runtime approach is fragile:
- Private repos fail silently with || true
- Runtime pip install and git clone add minutes to every cold start
- dockerArgs bash scripts are hard to debug and maintain
Better approach:
1. Write a proper Dockerfile that COPYs handler.py and installs deps at build time
2. Build and push to Docker Hub: docker build -t user/worker:latest . && docker push user/worker:latest
3. Update the RunPod template to use that image with empty dockerArgs
4. The Dockerfile CMD is the single source of truth for the entrypoint
This gives faster cold starts, no auth issues, and reproducible builds. Only use the public MoCha/model repos — your handler code gets baked into the image, no private repo cloning needed on the worker.
Network volumes pin workers to a specific datacenter. If the GPU tier you selected isn't available in that datacenter, workers will simply never start — the health endpoint shows 0 workers across all states (not even initializing).
Symptoms of GPU unavailability:
- Jobs sit IN_QUEUE indefinitely with massive delayTime (hours)
- Health shows: workers: {idle: 0, initializing: 0, ready: 0, running: 0, unhealthy: 0} despite queued jobs
- No error messages — just silent non-scheduling
- Workers may briefly show throttled: 1 — this means RunPod tried to provision but no capacity exists
Fixes:
1. Add ALL compatible GPU tiers: gpuIds: "AMPERE_48,ADA_48_PRO,AMPERE_80,ADA_80_PRO" (comma-separated, dramatically increases availability)
2. Try removing the network volume temporarily to allow any datacenter: networkVolumeId: "" in saveEndpoint
3. Create a new network volume in a busier datacenter — use trial-and-error with createNetworkVolume mutation (not all datacenters have storage clusters)
4. Check GPU availability: { gpuTypes { id displayName memoryInGb secureCloud communityCloud } }
5. Available GPU tiers for serverless: AMPERE_16, AMPERE_24, ADA_24, AMPERE_48, ADA_48_PRO, AMPERE_80, ADA_80_PRO
Worker health states explained:
- initializing → container starting, pulling image
- running → either processing a job OR still in startup code (model downloads happen BEFORE runpod.serverless.start())
- ready → idle, waiting for jobs (handler started, model download complete)
- unhealthy → worker crashed. Common causes: OOM from too-small GPU, disk space exhaustion (exit code 139 = SIGSEGV, usually disk full)
- throttled → RunPod tried to provision a GPU but none available in the pinned datacenter
CRITICAL: "running" does NOT mean "processing jobs"
When a worker shows running but jobs stay IN_QUEUE, the worker is still in its startup/initialization phase (downloading models, installing deps). The job won't move to IN_PROGRESS until runpod.serverless.start() is called. For large model downloads (~28GB), this can take 10-15+ minutes.
On RunPod Serverless, network volumes mount at /runpod-volume/, NOT /workspace/. This is a critical difference from RunPod Pods where volumes mount at /workspace/. If your env vars point model storage to /workspace/, models will download to the container disk instead of the persistent volume, causing:
- Disk space exhaustion (container disk is typically 20-50GB, models can be 30GB+)
- Worker crashes with exit code 139 (SIGSEGV from disk full)
- Models re-downloading on every cold start (defeating the purpose of the volume)
Always set storage paths to /runpod-volume/ for serverless:
MODEL_DIR=/runpod-volume/mocha-models
HF_HOME=/runpod-volume/huggingface
The container disk must fit the unpacked Docker image layers (2-3x the compressed size) PLUS any runtime files. Docker's overlay2 filesystem needs space for all layers unpacked.
Critical: compressed image size ≠ disk space needed. A 16GB compressed image can need 30-40GB unpacked. The error no space left on device during container startup (before any code runs) means the container disk is too small for just the image layers.
Sizing guide:
- With network volume: container disk = (compressed image × 2.5) + 20GB buffer. Models go on the volume.
- Without network volume: container disk = (compressed image × 2.5) + models + 20GB buffer
Example with 16GB compressed image + 28GB models: - With volume: 100GB container disk (image unpacked ~40GB + buffer) - Without volume: 200GB container disk (40GB image + 28GB models + buffer)
When in doubt, use 200GB container disk. The cost difference is negligible for serverless (you only pay when running), and debugging disk space failures is painful because RunPod gives minimal error info.
Always prefer network volume for large models. Container disk alone is fragile for 20GB+ model downloads.
Updating a template via saveTemplate does NOT automatically update the active endpoint release. You must also call saveEndpoint (with the endpoint id) to trigger a new release that picks up the template changes. The RunPod UI shows a release history — check it to verify your changes propagated.
# Step 1: Update template
saveTemplate(input: { id: "TEMPLATE_ID", imageName: "new-image:latest", ... })
# Step 2: Trigger new release on endpoint
saveEndpoint(input: { id: "ENDPOINT_ID", name: "...", templateId: "TEMPLATE_ID", ... })
# Purge all queued jobs
POST https://api.runpod.ai/v2/{ENDPOINT_ID}/purge-queue
Response: {"removed": N, "status": "completed"}
# Cancel a specific job
POST https://api.runpod.ai/v2/{ENDPOINT_ID}/cancel/{JOB_ID}
Always purge stale jobs after fixing a worker issue — old jobs submitted under a broken config will likely fail anyway.
App-side job tracking pitfall: If your web app tracks jobs in-memory and you purge RunPod's queue via API, the app still shows jobs as IN_QUEUE. The status poll (GET /status/{JOB_ID}) will return {"error": "request does not exist"} for purged jobs. Handle this in your polling logic by marking such jobs as CANCELLED.
When the RunPod account balance goes negative, serverless endpoints become unreachable. The REST API (api.runpod.ai/v2/{ENDPOINT_ID}/*) returns {"message": "Not Found"} for ALL operations (health, run, status). The GraphQL API still shows the endpoint exists. This is NOT a misconfiguration — it's RunPod disabling the endpoint due to unpaid balance.
Even after adding funds, the old endpoint stays broken. The fix is to delete and recreate:
# Delete broken endpoint
{"query": "mutation { deleteEndpoint(id: \"OLD_ENDPOINT_ID\") }"}
# Create fresh one with same config
{"query": "mutation { saveEndpoint(input: { name: \"my-endpoint\", templateId: \"TEMPLATE_ID\", networkVolumeId: \"VOLUME_ID\", gpuIds: \"AMPERE_48,ADA_48_PRO,AMPERE_80,ADA_80_PRO\", workersMin: 0, workersMax: 1, idleTimeout: 5 }) { id name } }"}
After recreating, update the endpoint ID everywhere it's referenced (web app config, environment variables, etc.) and restart the app.
Ongoing storage costs: Network volumes cost ~$0.047/hr per 50GB ($34/month) even when no workers are running. Monitor currentSpendPerHr — if it's non-zero with no active pods, that's the volume storage cost slowly draining the balance.
NEVER trust model size estimates in comments or documentation. Always verify actual download size before choosing network volume or container disk sizes:
# Check actual model size via HuggingFace API
curl -s "https://huggingface.co/api/models/ORG/MODEL/tree/main" | python3 -c "
import sys, json
files = json.load(sys.stdin)
total = 0
for f in files:
sz = f.get('size', 0)
if sz > 100_000_000:
print(f' {f[\"path\"]:60s} {sz/1e9:.2f} GB')
total += sz
print(f'Total: {total/1e9:.2f} GB ({len(files)} files)')
"
Real-world example: Wan2.1-T2V-14B was documented as "~26GB" but is actually 69GB. This caused a 50GB network volume to be insufficient, leading to silent crash loops.
HuggingFace download doubles disk usage: huggingface_hub downloads to HF_HOME cache first, then copies/symlinks to local_dir. If HF_HOME and local_dir are on the same volume AND the FS doesn't support reflinks, you need 2x the model size in free space. Set HF_HOME to the same volume as local_dir and use snapshot_download(local_dir=..., local_dir_use_symlinks=True) to avoid doubling.
Symptom: Worker shows running=1 in health, but jobs never move from IN_QUEUE to IN_PROGRESS. The worker periodically flips between running and idle/ready.
What's happening: The handler's __main__ block runs ensure_models() (or similar startup code) BEFORE calling runpod.serverless.start(). If the startup code fails (disk full, OOM, download error), the process crashes. RunPod auto-restarts the worker, which tries again and fails again — infinite crash loop.
Key indicators:
- delayTime on jobs grows to hundreds of thousands of ms (minutes to hours)
- currentSpendPerHr stays at just the storage rate (no GPU billing = workers aren't running long enough to bill)
- Health periodically shows ready=1, idle=1 (worker restarted, hasn't crashed yet) then goes back to running=1 (trying startup code again)
- Zero completed AND zero failed jobs (handler never registered)
Diagnosis steps: 1. Check RunPod UI logs (the only way to see worker stdout/stderr) 2. Calculate total model size vs volume/disk size (most common cause) 3. If models > volume size, the download fills the volume, crashes, volume cleanup on restart, repeat 4. Cancel all queued jobs — they'll never complete under a broken worker
| Code | Signal | Common Cause |
|---|---|---|
| 1 | - | Python exception, missing module |
| 2 | - | File not found (handler.py missing) |
| 137 | SIGKILL | OOM killed (GPU VRAM or system RAM) |
| 139 | SIGSEGV | Segfault — usually disk full during large writes |
throttled = no GPU capacity, unhealthy = crashedcontainerDiskInGb — if worker crashes during startup, it's almost always disk spacerunning but jobs stay IN_QUEUE for 15+ min, the startup code (model download) may be hanging or failing silentlyTo check everything at once, query these in parallel:
# 1. List all endpoints with config
{"query": "{ myself { endpoints { id name templateId workersMin workersMax gpuIds networkVolumeId idleTimeout } } }"}
# 2. List all pods (persistent)
{"query": "{ myself { pods { id name desiredStatus imageName volumeInGb volumeMountPath } } }"}
# 3. Check balance
{"query": "{ myself { clientBalance } }"}
These fields DO NOT EXIST (despite seeming logical):
- myself.serverlessWorkers — no such field on User type
- endpoint.jobs — no such field on Endpoint type (use REST API for job status)
- endpoint.workers — no such field (use REST health endpoint for worker counts)
- myself.serverlessTemplates — no such field (templates are queried via endpoint.template)
To get endpoint template details:
{"query": "{ myself { endpoints { id name template { id name imageName dockerArgs env { key value } } } } }"}
IMPORTANT: Job/health REST API uses api.runpod.ai, NOT api.runpod.io (GraphQL).
# Health check (worker counts)
GET https://api.runpod.ai/v2/{ENDPOINT_ID}/health
Headers: Authorization: Bearer {API_KEY}
# Submit job
POST https://api.runpod.ai/v2/{ENDPOINT_ID}/run
# Check job status
GET https://api.runpod.ai/v2/{ENDPOINT_ID}/status/{JOB_ID}
# Purge queue
POST https://api.runpod.ai/v2/{ENDPOINT_ID}/purge-queue
Note: The health endpoint may return {"message": "Not Found"} if the endpoint has never had a successful worker start. This doesn't mean the endpoint is misconfigured — it means no worker has ever registered.
You cannot browse network volume contents remotely when all pods are stopped. The only ways to verify what's on a volume: 1. Start a pod with the volume attached and SSH in 2. Trigger a serverless cold start (submit a test job) and check the worker logs 3. The download_models.py pattern (check file existence → skip if present) serves as implicit verification on each cold start
Check if your worker image was pushed correctly:
curl -s "https://hub.docker.com/v2/repositories/{USER}/{REPO}/tags/?page_size=5" | python3 -m json.tool
Key fields: full_size (compressed), last_updated. Compare compressed size to expected — e.g., an 8.7GB compressed image with just code+deps (no baked models) vs ~30GB+ with models baked in.
# Check balance
{"query": "{ myself { clientBalance } }"}
# List endpoints
{"query": "{ myself { endpoints { id name gpuIds workersMin workersMax } } }"}
# Update template (include id field)
{"query": "mutation { saveTemplate(input: { id: \"TEMPLATE_ID\", ... }) { id } }"}
saveTemplate, not updateTemplate.containerDiskInGb, volumeInGb, and env.dockerArgs: "" so the image entrypoint is not overridden.saveEndpoint after template changes.IN_QUEUE, suspect startup code before runpod.serverless.start() — often model downloads, disk exhaustion, or failed repo/bootstrap logic./runpod-volume/, not /workspace/.Not Found even though GraphQL still shows the endpoint.git clone in dockerArgs, private-repo auth failures can masquerade as missing-handler exit code 2 problems. Prefer baking the handler into the image.