runpod-serverless

/home/avalon/.hermes/skills/mlops/runpod-serverless/SKILL.md · raw

RunPod Serverless Deployment

Deploy GPU inference workloads that scale to zero. Worker spins up on job submit, processes, scales back down.

When to Use

User needs GPU inference (image gen, video processing, LLM, etc.)
On-demand GPU without always-on costs
Wrapping a GitHub ML project as an API service

Architecture Pattern

Web App (VPS) ──POST /run──> RunPod Endpoint ──> Worker (GPU)
     │                            │                    │
     │ <──poll /status/{id}──     │     downloads      │
     │                            │     from URLs       │
     │ <──result (video URL)──    │                    │

The web app on VPS handles file uploads, stores them locally (served via Express static), sends URLs to RunPod. Worker downloads files, processes, uploads result to S3 or returns base64.

API Reference

Base: https://api.runpod.io/graphql (management) and https://api.runpod.ai/v2/{ENDPOINT_ID} (jobs)

Auth header: Authorization: Bearer {API_KEY}

Job Submission

POST https://api.runpod.ai/v2/{ENDPOINT_ID}/run
{"input": { ... }}

Response: {"id": "job-uuid", "status": "IN_QUEUE"}

Job Status

GET https://api.runpod.ai/v2/{ENDPOINT_ID}/status/{JOB_ID}

Response: {"id": "...", "status": "COMPLETED", "output": { ... }}

Statuses: IN_QUEUE → IN_PROGRESS → COMPLETED | FAILED | CANCELLED

Endpoint Health

GET https://api.runpod.ai/v2/{ENDPOINT_ID}/health

Setup Steps

1. Create Network Volume (for model caching)

# Write query to file to avoid shell escaping hell
cat > /tmp/rp_query.json << 'EOF'
{"query": "mutation { createNetworkVolume(input: { name: \"my-models\", size: 50, dataCenterId: \"US-TX-3\" }) { id name size } }"}
EOF

curl -s -H "Authorization: Bearer $RUNPOD_KEY" \
  https://api.runpod.io/graphql \
  -H "Content-Type: application/json" \
  -d @/tmp/rp_query.json

2. Create Template

IMPORTANT: dockerArgs is a REQUIRED field (not optional).

cat > /tmp/rp_template.json << 'EOF'
{"query": "mutation { saveTemplate(input: { name: \"My Worker\", imageName: \"runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04\", dockerArgs: \"bash -c 'if [ ! -d /runpod-volume/handler-repo ]; then git clone https://github.com/USER/REPO.git /runpod-volume/handler-repo; fi && bash /runpod-volume/handler-repo/start.sh'\", containerDiskInGb: 20, volumeInGb: 0, isServerless: true, startJupyter: false, startSsh: false, volumeMountPath: \"/runpod-volume\", env: [{key: \"MODEL_DIR\", value: \"/runpod-volume/models\"}, {key: \"HF_HOME\", value: \"/runpod-volume/huggingface\"}] }) { id name } }"}
EOF

curl -s -H "Authorization: Bearer $RUNPOD_KEY" \
  https://api.runpod.io/graphql \
  -H "Content-Type: application/json" \
  -d @/tmp/rp_template.json

3. Create Endpoint

cat > /tmp/rp_endpoint.json << 'EOF'
{"query": "mutation { saveEndpoint(input: { name: \"my-endpoint\", templateId: \"TEMPLATE_ID\", networkVolumeId: \"VOLUME_ID\", gpuIds: \"AMPERE_48\", workersMin: 0, workersMax: 1, idleTimeout: 5, scalerType: \"QUEUE_DELAY\", scalerValue: 4 }) { id name gpuIds workersMin workersMax } }"}
EOF

curl -s -H "Authorization: Bearer $RUNPOD_KEY" \
  https://api.runpod.io/graphql \
  -H "Content-Type: application/json" \
  -d @/tmp/rp_endpoint.json

GPU Tiers

Tier	VRAM	Example GPUs	Cost/sec (flex)
AMPERE_24	24GB	RTX 4090	~$0.00031
AMPERE_48	48GB	L40S	~$0.00053
AMPERE_80	80GB	A100	~$0.00076

Handler Pattern (handler.py)

import runpod
import os, requests, subprocess

def handler(job):
    input_data = job["input"]

    # Download input files from URLs
    video_path = download(input_data["video_url"], "/tmp/input.mp4")

    # Run inference
    runpod.serverless.progress_update(job, "Processing...")
    result = subprocess.run(["python", "inference.py", ...], ...)

    # Return result (URL or base64)
    return {"status": "completed", "output_url": "..."}

if __name__ == "__main__":
    # Download models on cold start (cached on network volume)
    ensure_models()
    runpod.serverless.start({"handler": handler})

Bootstrap Script Pattern (start.sh)

Handles first-run setup on network volume:

#!/bin/bash
set -e
# System deps
if ! command -v ffmpeg &>/dev/null; then
    apt-get update && apt-get install -y ffmpeg libgl1 libglib2.0-0
fi
# Clone ML repo (cached on volume)
if [ ! -d /runpod-volume/MyProject ]; then
    git clone https://github.com/ORG/PROJECT.git /runpod-volume/MyProject
fi
# Install deps (cached via flag file)
if [ ! -f /runpod-volume/.deps_installed_v1 ]; then
    pip install -r /runpod-volume/MyProject/requirements.txt
    pip install runpod boto3 requests huggingface_hub
    touch /runpod-volume/.deps_installed_v1
fi
# Update handler (always fresh)
cd /runpod-volume/handler-repo && git pull 2>/dev/null || true
# Run
exec python -u /runpod-volume/handler-repo/handler.py

File Handling

RunPod does NOT support multipart uploads. Two approaches: 1. URL-based (recommended): Web app stores files locally, serves via Express static, sends URLs to worker. Worker downloads via HTTP. 2. Base64: Encode files in request body. Bad for large files (video).

For results: upload to S3, or return base64 for small files.

Pitfalls

ALWAYS write GraphQL queries to a JSON file and use curl -d @file. Shell escaping of nested quotes + bash inside dockerArgs inside JSON is impossible to get right inline.
dockerArgs is REQUIRED in saveTemplate — omitting it causes a validation error.
Field names differ from docs: flashBoot → use nothing (not a valid field), creditBalance → clientBalance. Test with introspection or trial/error.
Network volume + model caching: Don't bake 30GB+ models into Docker images. Use a network volume with first-run download + flag file caching.
Bootstrap chicken-and-egg: dockerArgs must clone the handler repo if it doesn't exist on the volume yet. The start.sh lives IN the repo that gets cloned.
Cold start: First ever start downloads models (~15-20 min for large models). Subsequent cold starts: ~2-5 min (deps + repo already cached, just container startup).
Can't build large Docker images on small VPS: If disk < model size, use the network volume approach instead of baking models into the image.
Certbot on VPS may fail first try: ConnectionResetError is common. Retry after sleep 3.
Nginx for large uploads: Add client_max_body_size 200M; for video upload apps.
Private repo cloning in dockerArgs: If the handler repo is private, git clone silently fails (especially with || true). The worker starts but can't find handler.py → exit code 2 + "No such file or directory". NEVER hardcode GitHub tokens in dockerArgs — they're visible in plaintext in the template config. Best solutions (in order of preference):
- Build a Docker image with the handler baked in (most robust — no cloning, no auth, faster cold starts)
- Make the repo public if the handler code isn't sensitive
- Add a fine-grained GitHub PAT (read-only, scoped to single repo) as a RunPod environment variable (e.g., GH_TOKEN) and reference it: git clone https://$GH_TOKEN@github.com/USER/REPO.git ...
- Never embed tokens directly in dockerArgs — compromises the entire GitHub account
Serverless vs Pods: For sporadic/bursty workloads (video processing, image gen), serverless is almost always the right choice (pay-per-job, auto-scale, no idle costs). Only use persistent pods for sustained inference or when you need SSH debugging access. Don't switch architectures just because serverless isn't working — diagnose the actual issue first.

Debugging Worker Failures

Common error: worker exited with exit code 2 — usually means the Python handler script wasn't found.

Diagnosis checklist: 1. Check RunPod endpoint logs for the actual error message 2. Verify the dockerArgs or Dockerfile copies/clones the handler to the right path 3. If using git clone in dockerArgs, test the clone URL manually (private repo? auth needed?) 4. The || true pattern in bash masks clone failures — remove it temporarily to see errors 5. Compare the path in the CMD/exec line with where the handler actually ends up

Querying template config via API (to inspect dockerArgs remotely):

{"query": "{ myself { endpoints { id name template { id name imageName dockerArgs env { key value } } } } }"}

Pre-built Docker Image vs Runtime dockerArgs

Prefer building a Docker image with the handler baked in over cloning repos at runtime in dockerArgs. The runtime approach is fragile: - Private repos fail silently with || true - Runtime pip install and git clone add minutes to every cold start - dockerArgs bash scripts are hard to debug and maintain

Better approach: 1. Write a proper Dockerfile that COPYs handler.py and installs deps at build time 2. Build and push to Docker Hub: docker build -t user/worker:latest . && docker push user/worker:latest 3. Update the RunPod template to use that image with empty dockerArgs 4. The Dockerfile CMD is the single source of truth for the entrypoint

This gives faster cold starts, no auth issues, and reproducible builds. Only use the public MoCha/model repos — your handler code gets baked into the image, no private repo cloning needed on the worker.

GPU Availability and Network Volumes

Network volumes pin workers to a specific datacenter. If the GPU tier you selected isn't available in that datacenter, workers will simply never start — the health endpoint shows 0 workers across all states (not even initializing).

Symptoms of GPU unavailability: - Jobs sit IN_QUEUE indefinitely with massive delayTime (hours) - Health shows: workers: {idle: 0, initializing: 0, ready: 0, running: 0, unhealthy: 0} despite queued jobs - No error messages — just silent non-scheduling - Workers may briefly show throttled: 1 — this means RunPod tried to provision but no capacity exists

Fixes: 1. Add ALL compatible GPU tiers: gpuIds: "AMPERE_48,ADA_48_PRO,AMPERE_80,ADA_80_PRO" (comma-separated, dramatically increases availability) 2. Try removing the network volume temporarily to allow any datacenter: networkVolumeId: "" in saveEndpoint 3. Create a new network volume in a busier datacenter — use trial-and-error with createNetworkVolume mutation (not all datacenters have storage clusters) 4. Check GPU availability: { gpuTypes { id displayName memoryInGb secureCloud communityCloud } } 5. Available GPU tiers for serverless: AMPERE_16, AMPERE_24, ADA_24, AMPERE_48, ADA_48_PRO, AMPERE_80, ADA_80_PRO

Worker health states explained: - initializing → container starting, pulling image - running → either processing a job OR still in startup code (model downloads happen BEFORE runpod.serverless.start()) - ready → idle, waiting for jobs (handler started, model download complete) - unhealthy → worker crashed. Common causes: OOM from too-small GPU, disk space exhaustion (exit code 139 = SIGSEGV, usually disk full) - throttled → RunPod tried to provision a GPU but none available in the pinned datacenter

CRITICAL: "running" does NOT mean "processing jobs" When a worker shows running but jobs stay IN_QUEUE, the worker is still in its startup/initialization phase (downloading models, installing deps). The job won't move to IN_PROGRESS until runpod.serverless.start() is called. For large model downloads (~28GB), this can take 10-15+ minutes.

Network Volume Mount Path

On RunPod Serverless, network volumes mount at /runpod-volume/, NOT /workspace/. This is a critical difference from RunPod Pods where volumes mount at /workspace/. If your env vars point model storage to /workspace/, models will download to the container disk instead of the persistent volume, causing: - Disk space exhaustion (container disk is typically 20-50GB, models can be 30GB+) - Worker crashes with exit code 139 (SIGSEGV from disk full) - Models re-downloading on every cold start (defeating the purpose of the volume)

Always set storage paths to /runpod-volume/ for serverless:

MODEL_DIR=/runpod-volume/mocha-models
HF_HOME=/runpod-volume/huggingface

Container Disk Sizing

The container disk must fit the unpacked Docker image layers (2-3x the compressed size) PLUS any runtime files. Docker's overlay2 filesystem needs space for all layers unpacked.

Critical: compressed image size ≠ disk space needed. A 16GB compressed image can need 30-40GB unpacked. The error no space left on device during container startup (before any code runs) means the container disk is too small for just the image layers.

Sizing guide: - With network volume: container disk = (compressed image × 2.5) + 20GB buffer. Models go on the volume. - Without network volume: container disk = (compressed image × 2.5) + models + 20GB buffer

Example with 16GB compressed image + 28GB models: - With volume: 100GB container disk (image unpacked ~40GB + buffer) - Without volume: 200GB container disk (40GB image + 28GB models + buffer)

When in doubt, use 200GB container disk. The cost difference is negligible for serverless (you only pay when running), and debugging disk space failures is painful because RunPod gives minimal error info.

Always prefer network volume for large models. Container disk alone is fragile for 20GB+ model downloads.

Updating Templates vs Releases

Updating a template via saveTemplate does NOT automatically update the active endpoint release. You must also call saveEndpoint (with the endpoint id) to trigger a new release that picks up the template changes. The RunPod UI shows a release history — check it to verify your changes propagated.

# Step 1: Update template
saveTemplate(input: { id: "TEMPLATE_ID", imageName: "new-image:latest", ... })

# Step 2: Trigger new release on endpoint  
saveEndpoint(input: { id: "ENDPOINT_ID", name: "...", templateId: "TEMPLATE_ID", ... })

Queue Management

# Purge all queued jobs
POST https://api.runpod.ai/v2/{ENDPOINT_ID}/purge-queue
Response: {"removed": N, "status": "completed"}

# Cancel a specific job
POST https://api.runpod.ai/v2/{ENDPOINT_ID}/cancel/{JOB_ID}

Always purge stale jobs after fixing a worker issue — old jobs submitted under a broken config will likely fail anyway.

App-side job tracking pitfall: If your web app tracks jobs in-memory and you purge RunPod's queue via API, the app still shows jobs as IN_QUEUE. The status poll (GET /status/{JOB_ID}) will return {"error": "request does not exist"} for purged jobs. Handle this in your polling logic by marking such jobs as CANCELLED.

Endpoint Goes 404 After Balance Runs Out

When the RunPod account balance goes negative, serverless endpoints become unreachable. The REST API (api.runpod.ai/v2/{ENDPOINT_ID}/*) returns {"message": "Not Found"} for ALL operations (health, run, status). The GraphQL API still shows the endpoint exists. This is NOT a misconfiguration — it's RunPod disabling the endpoint due to unpaid balance.

Even after adding funds, the old endpoint stays broken. The fix is to delete and recreate:

# Delete broken endpoint
{"query": "mutation { deleteEndpoint(id: \"OLD_ENDPOINT_ID\") }"}

# Create fresh one with same config
{"query": "mutation { saveEndpoint(input: { name: \"my-endpoint\", templateId: \"TEMPLATE_ID\", networkVolumeId: \"VOLUME_ID\", gpuIds: \"AMPERE_48,ADA_48_PRO,AMPERE_80,ADA_80_PRO\", workersMin: 0, workersMax: 1, idleTimeout: 5 }) { id name } }"}

After recreating, update the endpoint ID everywhere it's referenced (web app config, environment variables, etc.) and restart the app.

Ongoing storage costs: Network volumes cost ~$0.047/hr per 50GB ($34/month) even when no workers are running. Monitor currentSpendPerHr — if it's non-zero with no active pods, that's the volume storage cost slowly draining the balance.

Model Size Verification — ALWAYS Check Before Provisioning

NEVER trust model size estimates in comments or documentation. Always verify actual download size before choosing network volume or container disk sizes:

# Check actual model size via HuggingFace API
curl -s "https://huggingface.co/api/models/ORG/MODEL/tree/main" | python3 -c "
import sys, json
files = json.load(sys.stdin)
total = 0
for f in files:
    sz = f.get('size', 0)
    if sz > 100_000_000:
        print(f'  {f[\"path\"]:60s} {sz/1e9:.2f} GB')
    total += sz
print(f'Total: {total/1e9:.2f} GB ({len(files)} files)')
"

Real-world example: Wan2.1-T2V-14B was documented as "~26GB" but is actually 69GB. This caused a 50GB network volume to be insufficient, leading to silent crash loops.

HuggingFace download doubles disk usage: huggingface_hub downloads to HF_HOME cache first, then copies/symlinks to local_dir. If HF_HOME and local_dir are on the same volume AND the FS doesn't support reflinks, you need 2x the model size in free space. Set HF_HOME to the same volume as local_dir and use snapshot_download(local_dir=..., local_dir_use_symlinks=True) to avoid doubling.

Worker Crash Loop Detection

Symptom: Worker shows running=1 in health, but jobs never move from IN_QUEUE to IN_PROGRESS. The worker periodically flips between running and idle/ready.

What's happening: The handler's __main__ block runs ensure_models() (or similar startup code) BEFORE calling runpod.serverless.start(). If the startup code fails (disk full, OOM, download error), the process crashes. RunPod auto-restarts the worker, which tries again and fails again — infinite crash loop.

Key indicators: - delayTime on jobs grows to hundreds of thousands of ms (minutes to hours) - currentSpendPerHr stays at just the storage rate (no GPU billing = workers aren't running long enough to bill) - Health periodically shows ready=1, idle=1 (worker restarted, hasn't crashed yet) then goes back to running=1 (trying startup code again) - Zero completed AND zero failed jobs (handler never registered)

Diagnosis steps: 1. Check RunPod UI logs (the only way to see worker stdout/stderr) 2. Calculate total model size vs volume/disk size (most common cause) 3. If models > volume size, the download fills the volume, crashes, volume cleanup on restart, repeat 4. Cancel all queued jobs — they'll never complete under a broken worker

Exit Code Reference

Code	Signal	Common Cause
1	-	Python exception, missing module
2	-	File not found (handler.py missing)
137	SIGKILL	OOM killed (GPU VRAM or system RAM)
139	SIGSEGV	Segfault — usually disk full during large writes

Debugging Workflow (recommended order)

Check endpoint logs in RunPod UI (Logs tab) for actual error messages
Check health endpoint for worker states — throttled = no GPU capacity, unhealthy = crashed
Check containerDiskInGb — if worker crashes during startup, it's almost always disk space
If workers never appear (all zeros), it's GPU availability — add more GPU tiers or remove network volume
If worker shows running but jobs stay IN_QUEUE for 15+ min, the startup code (model download) may be hanging or failing silently

Monitoring & Status Checks

Full Infrastructure Audit

To check everything at once, query these in parallel:

# 1. List all endpoints with config
{"query": "{ myself { endpoints { id name templateId workersMin workersMax gpuIds networkVolumeId idleTimeout } } }"}

# 2. List all pods (persistent)
{"query": "{ myself { pods { id name desiredStatus imageName volumeInGb volumeMountPath } } }"}

# 3. Check balance
{"query": "{ myself { clientBalance } }"}

GraphQL Field Gotchas

These fields DO NOT EXIST (despite seeming logical): - myself.serverlessWorkers — no such field on User type - endpoint.jobs — no such field on Endpoint type (use REST API for job status) - endpoint.workers — no such field (use REST health endpoint for worker counts) - myself.serverlessTemplates — no such field (templates are queried via endpoint.template)

To get endpoint template details:

{"query": "{ myself { endpoints { id name template { id name imageName dockerArgs env { key value } } } } }"}

REST API Endpoints (for jobs + health)

IMPORTANT: Job/health REST API uses api.runpod.ai, NOT api.runpod.io (GraphQL).

# Health check (worker counts)
GET https://api.runpod.ai/v2/{ENDPOINT_ID}/health
Headers: Authorization: Bearer {API_KEY}

# Submit job
POST https://api.runpod.ai/v2/{ENDPOINT_ID}/run

# Check job status  
GET https://api.runpod.ai/v2/{ENDPOINT_ID}/status/{JOB_ID}

# Purge queue
POST https://api.runpod.ai/v2/{ENDPOINT_ID}/purge-queue

Note: The health endpoint may return {"message": "Not Found"} if the endpoint has never had a successful worker start. This doesn't mean the endpoint is misconfigured — it means no worker has ever registered.

Checking Network Volume Contents

You cannot browse network volume contents remotely when all pods are stopped. The only ways to verify what's on a volume: 1. Start a pod with the volume attached and SSH in 2. Trigger a serverless cold start (submit a test job) and check the worker logs 3. The download_models.py pattern (check file existence → skip if present) serves as implicit verification on each cold start

Docker Hub Image Verification

Check if your worker image was pushed correctly:

curl -s "https://hub.docker.com/v2/repositories/{USER}/{REPO}/tags/?page_size=5" | python3 -m json.tool

Key fields: full_size (compressed), last_updated. Compare compressed size to expected — e.g., an 8.7GB compressed image with just code+deps (no baked models) vs ~30GB+ with models baked in.

Useful Queries

# Check balance
{"query": "{ myself { clientBalance } }"}

# List endpoints
{"query": "{ myself { endpoints { id name gpuIds workersMin workersMax } } }"}

# Update template (include id field)
{"query": "mutation { saveTemplate(input: { id: \"TEMPLATE_ID\", ... }) { id } }"}

Session-specific management/debugging notes absorbed from narrower RunPod skills

Template updates

The mutation is saveTemplate, not updateTemplate.
Even partial template edits still require fields such as containerDiskInGb, volumeInGb, and env.
For custom images with a real CMD/ENTRYPOINT, set dockerArgs: "" so the image entrypoint is not overridden.
Updating a template alone is not enough; trigger a fresh endpoint release with saveEndpoint after template changes.

Crash-loop and storage diagnosis

If workers look alive but jobs stay IN_QUEUE, suspect startup code before runpod.serverless.start() — often model downloads, disk exhaustion, or failed repo/bootstrap logic.
Verify actual model sizes before provisioning. Documentation estimates were sometimes dramatically wrong.
Remember HuggingFace download/cache behavior can temporarily require close to 2x model size unless cache/local-dir choices avoid duplication.
Network volumes on serverless mount at /runpod-volume/, not /workspace/.
Container disk must fit unpacked image layers plus runtime writes; compressed image size is a misleading lower bound.

Operational failure modes worth checking early

Negative account balance can make serverless REST endpoints return Not Found even though GraphQL still shows the endpoint.
If GPU availability is poor in a pinned datacenter, adding more compatible GPU tiers is often the fastest fix.
When using runtime git clone in dockerArgs, private-repo auth failures can masquerade as missing-handler exit code 2 problems. Prefer baking the handler into the image.