---
name: runpod-serverless
description: Deploy GPU workloads on RunPod Serverless — create templates, endpoints, network volumes, and handler workers via GraphQL API. Scales to zero when idle. Use when user needs on-demand GPU for ML inference behind a web app.
version: 1.0.0
tags: [runpod, serverless, gpu, inference, deployment, api]
metadata:
  hermes:
    tags: [runpod, serverless, gpu, inference, deployment, api]
---

# RunPod Serverless Deployment

Deploy GPU inference workloads that scale to zero. Worker spins up on job submit, processes, scales back down.

## When to Use

- User needs GPU inference (image gen, video processing, LLM, etc.)
- On-demand GPU without always-on costs
- Wrapping a GitHub ML project as an API service

## Architecture Pattern

```
Web App (VPS) ──POST /run──> RunPod Endpoint ──> Worker (GPU)
     │                            │                    │
     │ <──poll /status/{id}──     │     downloads      │
     │                            │     from URLs       │
     │ <──result (video URL)──    │                    │
```

The web app on VPS handles file uploads, stores them locally (served via Express static), sends URLs to RunPod. Worker downloads files, processes, uploads result to S3 or returns base64.

## API Reference

Base: `https://api.runpod.io/graphql` (management) and `https://api.runpod.ai/v2/{ENDPOINT_ID}` (jobs)

Auth header: `Authorization: Bearer {API_KEY}`

### Job Submission
```
POST https://api.runpod.ai/v2/{ENDPOINT_ID}/run
{"input": { ... }}

Response: {"id": "job-uuid", "status": "IN_QUEUE"}
```

### Job Status
```
GET https://api.runpod.ai/v2/{ENDPOINT_ID}/status/{JOB_ID}

Response: {"id": "...", "status": "COMPLETED", "output": { ... }}
```

Statuses: IN_QUEUE → IN_PROGRESS → COMPLETED | FAILED | CANCELLED

### Endpoint Health
```
GET https://api.runpod.ai/v2/{ENDPOINT_ID}/health
```

## Setup Steps

### 1. Create Network Volume (for model caching)

```bash
# Write query to file to avoid shell escaping hell
cat > /tmp/rp_query.json << 'EOF'
{"query": "mutation { createNetworkVolume(input: { name: \"my-models\", size: 50, dataCenterId: \"US-TX-3\" }) { id name size } }"}
EOF

curl -s -H "Authorization: Bearer $RUNPOD_KEY" \
  https://api.runpod.io/graphql \
  -H "Content-Type: application/json" \
  -d @/tmp/rp_query.json
```

### 2. Create Template

**IMPORTANT**: `dockerArgs` is a REQUIRED field (not optional).

```bash
cat > /tmp/rp_template.json << 'EOF'
{"query": "mutation { saveTemplate(input: { name: \"My Worker\", imageName: \"runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04\", dockerArgs: \"bash -c 'if [ ! -d /runpod-volume/handler-repo ]; then git clone https://github.com/USER/REPO.git /runpod-volume/handler-repo; fi && bash /runpod-volume/handler-repo/start.sh'\", containerDiskInGb: 20, volumeInGb: 0, isServerless: true, startJupyter: false, startSsh: false, volumeMountPath: \"/runpod-volume\", env: [{key: \"MODEL_DIR\", value: \"/runpod-volume/models\"}, {key: \"HF_HOME\", value: \"/runpod-volume/huggingface\"}] }) { id name } }"}
EOF

curl -s -H "Authorization: Bearer $RUNPOD_KEY" \
  https://api.runpod.io/graphql \
  -H "Content-Type: application/json" \
  -d @/tmp/rp_template.json
```

### 3. Create Endpoint

```bash
cat > /tmp/rp_endpoint.json << 'EOF'
{"query": "mutation { saveEndpoint(input: { name: \"my-endpoint\", templateId: \"TEMPLATE_ID\", networkVolumeId: \"VOLUME_ID\", gpuIds: \"AMPERE_48\", workersMin: 0, workersMax: 1, idleTimeout: 5, scalerType: \"QUEUE_DELAY\", scalerValue: 4 }) { id name gpuIds workersMin workersMax } }"}
EOF

curl -s -H "Authorization: Bearer $RUNPOD_KEY" \
  https://api.runpod.io/graphql \
  -H "Content-Type: application/json" \
  -d @/tmp/rp_endpoint.json
```

### GPU Tiers
| Tier | VRAM | Example GPUs | Cost/sec (flex) |
|------|------|-------------|-----------------|
| AMPERE_24 | 24GB | RTX 4090 | ~$0.00031 |
| AMPERE_48 | 48GB | L40S | ~$0.00053 |
| AMPERE_80 | 80GB | A100 | ~$0.00076 |

## Handler Pattern (handler.py)

```python
import runpod
import os, requests, subprocess

def handler(job):
    input_data = job["input"]
    
    # Download input files from URLs
    video_path = download(input_data["video_url"], "/tmp/input.mp4")
    
    # Run inference
    runpod.serverless.progress_update(job, "Processing...")
    result = subprocess.run(["python", "inference.py", ...], ...)
    
    # Return result (URL or base64)
    return {"status": "completed", "output_url": "..."}

if __name__ == "__main__":
    # Download models on cold start (cached on network volume)
    ensure_models()
    runpod.serverless.start({"handler": handler})
```

## Bootstrap Script Pattern (start.sh)

Handles first-run setup on network volume:

```bash
#!/bin/bash
set -e
# System deps
if ! command -v ffmpeg &>/dev/null; then
    apt-get update && apt-get install -y ffmpeg libgl1 libglib2.0-0
fi
# Clone ML repo (cached on volume)
if [ ! -d /runpod-volume/MyProject ]; then
    git clone https://github.com/ORG/PROJECT.git /runpod-volume/MyProject
fi
# Install deps (cached via flag file)
if [ ! -f /runpod-volume/.deps_installed_v1 ]; then
    pip install -r /runpod-volume/MyProject/requirements.txt
    pip install runpod boto3 requests huggingface_hub
    touch /runpod-volume/.deps_installed_v1
fi
# Update handler (always fresh)
cd /runpod-volume/handler-repo && git pull 2>/dev/null || true
# Run
exec python -u /runpod-volume/handler-repo/handler.py
```

## File Handling

RunPod does NOT support multipart uploads. Two approaches:
1. **URL-based (recommended)**: Web app stores files locally, serves via Express static, sends URLs to worker. Worker downloads via HTTP.
2. **Base64**: Encode files in request body. Bad for large files (video).

For results: upload to S3, or return base64 for small files.

## Pitfalls

1. **ALWAYS write GraphQL queries to a JSON file** and use `curl -d @file`. Shell escaping of nested quotes + bash inside dockerArgs inside JSON is impossible to get right inline.
2. **dockerArgs is REQUIRED** in saveTemplate — omitting it causes a validation error.
3. **Field names differ from docs**: `flashBoot` → use nothing (not a valid field), `creditBalance` → `clientBalance`. Test with introspection or trial/error.
4. **Network volume + model caching**: Don't bake 30GB+ models into Docker images. Use a network volume with first-run download + flag file caching.
5. **Bootstrap chicken-and-egg**: dockerArgs must clone the handler repo if it doesn't exist on the volume yet. The start.sh lives IN the repo that gets cloned.
6. **Cold start**: First ever start downloads models (~15-20 min for large models). Subsequent cold starts: ~2-5 min (deps + repo already cached, just container startup).
7. **Can't build large Docker images on small VPS**: If disk < model size, use the network volume approach instead of baking models into the image.
8. **Certbot on VPS may fail first try**: `ConnectionResetError` is common. Retry after `sleep 3`.
9. **Nginx for large uploads**: Add `client_max_body_size 200M;` for video upload apps.
10. **Private repo cloning in dockerArgs**: If the handler repo is private, `git clone` silently fails (especially with `|| true`). The worker starts but can't find handler.py → `exit code 2` + "No such file or directory". **NEVER hardcode GitHub tokens in dockerArgs** — they're visible in plaintext in the template config. Best solutions (in order of preference):
    - **Build a Docker image with the handler baked in** (most robust — no cloning, no auth, faster cold starts)
    - Make the repo public if the handler code isn't sensitive
    - Add a fine-grained GitHub PAT (read-only, scoped to single repo) as a RunPod **environment variable** (e.g., `GH_TOKEN`) and reference it: `git clone https://$GH_TOKEN@github.com/USER/REPO.git ...`
    - **Never embed tokens directly in dockerArgs** — compromises the entire GitHub account
11. **Serverless vs Pods**: For sporadic/bursty workloads (video processing, image gen), serverless is almost always the right choice (pay-per-job, auto-scale, no idle costs). Only use persistent pods for sustained inference or when you need SSH debugging access. Don't switch architectures just because serverless isn't working — diagnose the actual issue first.

## Debugging Worker Failures

Common error: `worker exited with exit code 2` — usually means the Python handler script wasn't found.

**Diagnosis checklist:**
1. Check RunPod endpoint logs for the actual error message
2. Verify the `dockerArgs` or Dockerfile copies/clones the handler to the right path
3. If using `git clone` in dockerArgs, test the clone URL manually (private repo? auth needed?)
4. The `|| true` pattern in bash masks clone failures — remove it temporarily to see errors
5. Compare the path in the CMD/exec line with where the handler actually ends up

**Querying template config via API** (to inspect dockerArgs remotely):
```bash
{"query": "{ myself { endpoints { id name template { id name imageName dockerArgs env { key value } } } } }"}
```

### Pre-built Docker Image vs Runtime dockerArgs

**Prefer building a Docker image** with the handler baked in over cloning repos at runtime in `dockerArgs`. The runtime approach is fragile:
- Private repos fail silently with `|| true`
- Runtime `pip install` and `git clone` add minutes to every cold start
- `dockerArgs` bash scripts are hard to debug and maintain

**Better approach:**
1. Write a proper Dockerfile that COPYs handler.py and installs deps at build time
2. Build and push to Docker Hub: `docker build -t user/worker:latest . && docker push user/worker:latest`
3. Update the RunPod template to use that image with empty `dockerArgs`
4. The Dockerfile CMD is the single source of truth for the entrypoint

This gives faster cold starts, no auth issues, and reproducible builds. Only use the public MoCha/model repos — your handler code gets baked into the image, no private repo cloning needed on the worker.

### GPU Availability and Network Volumes

**Network volumes pin workers to a specific datacenter.** If the GPU tier you selected isn't available in that datacenter, workers will simply never start — the health endpoint shows 0 workers across all states (not even initializing).

**Symptoms of GPU unavailability:**
- Jobs sit IN_QUEUE indefinitely with massive `delayTime` (hours)
- Health shows: `workers: {idle: 0, initializing: 0, ready: 0, running: 0, unhealthy: 0}` despite queued jobs
- No error messages — just silent non-scheduling
- Workers may briefly show `throttled: 1` — this means RunPod tried to provision but no capacity exists

**Fixes:**
1. Add ALL compatible GPU tiers: `gpuIds: "AMPERE_48,ADA_48_PRO,AMPERE_80,ADA_80_PRO"` (comma-separated, dramatically increases availability)
2. Try removing the network volume temporarily to allow any datacenter: `networkVolumeId: ""` in saveEndpoint
3. Create a new network volume in a busier datacenter — use trial-and-error with `createNetworkVolume` mutation (not all datacenters have storage clusters)
4. Check GPU availability: `{ gpuTypes { id displayName memoryInGb secureCloud communityCloud } }`
5. Available GPU tiers for serverless: `AMPERE_16`, `AMPERE_24`, `ADA_24`, `AMPERE_48`, `ADA_48_PRO`, `AMPERE_80`, `ADA_80_PRO`

**Worker health states explained:**
- `initializing` → container starting, pulling image
- `running` → either processing a job OR still in startup code (model downloads happen BEFORE `runpod.serverless.start()`)
- `ready` → idle, waiting for jobs (handler started, model download complete)
- `unhealthy` → worker crashed. Common causes: OOM from too-small GPU, disk space exhaustion (exit code 139 = SIGSEGV, usually disk full)
- `throttled` → RunPod tried to provision a GPU but none available in the pinned datacenter

**CRITICAL: "running" does NOT mean "processing jobs"**
When a worker shows `running` but jobs stay `IN_QUEUE`, the worker is still in its startup/initialization phase (downloading models, installing deps). The job won't move to `IN_PROGRESS` until `runpod.serverless.start()` is called. For large model downloads (~28GB), this can take 10-15+ minutes.

### Network Volume Mount Path

**On RunPod Serverless, network volumes mount at `/runpod-volume/`, NOT `/workspace/`.** This is a critical difference from RunPod Pods where volumes mount at `/workspace/`. If your env vars point model storage to `/workspace/`, models will download to the container disk instead of the persistent volume, causing:
- Disk space exhaustion (container disk is typically 20-50GB, models can be 30GB+)
- Worker crashes with exit code 139 (SIGSEGV from disk full)
- Models re-downloading on every cold start (defeating the purpose of the volume)

**Always set storage paths to `/runpod-volume/` for serverless:**
```
MODEL_DIR=/runpod-volume/mocha-models
HF_HOME=/runpod-volume/huggingface
```

### Container Disk Sizing

The container disk must fit the **unpacked** Docker image layers (2-3x the compressed size) PLUS any runtime files. Docker's overlay2 filesystem needs space for all layers unpacked.

**Critical: compressed image size ≠ disk space needed.** A 16GB compressed image can need 30-40GB unpacked. The error `no space left on device` during container startup (before any code runs) means the container disk is too small for just the image layers.

Sizing guide:
- **With network volume**: container disk = `(compressed image × 2.5) + 20GB buffer`. Models go on the volume.
- **Without network volume**: container disk = `(compressed image × 2.5) + models + 20GB buffer`

Example with 16GB compressed image + 28GB models:
- With volume: 100GB container disk (image unpacked ~40GB + buffer)
- Without volume: 200GB container disk (40GB image + 28GB models + buffer)

**When in doubt, use 200GB container disk.** The cost difference is negligible for serverless (you only pay when running), and debugging disk space failures is painful because RunPod gives minimal error info.

**Always prefer network volume for large models.** Container disk alone is fragile for 20GB+ model downloads.

### Updating Templates vs Releases

**Updating a template via `saveTemplate` does NOT automatically update the active endpoint release.** You must also call `saveEndpoint` (with the endpoint `id`) to trigger a new release that picks up the template changes. The RunPod UI shows a release history — check it to verify your changes propagated.

```bash
# Step 1: Update template
saveTemplate(input: { id: "TEMPLATE_ID", imageName: "new-image:latest", ... })

# Step 2: Trigger new release on endpoint  
saveEndpoint(input: { id: "ENDPOINT_ID", name: "...", templateId: "TEMPLATE_ID", ... })
```

### Queue Management

```bash
# Purge all queued jobs
POST https://api.runpod.ai/v2/{ENDPOINT_ID}/purge-queue
Response: {"removed": N, "status": "completed"}

# Cancel a specific job
POST https://api.runpod.ai/v2/{ENDPOINT_ID}/cancel/{JOB_ID}
```

Always purge stale jobs after fixing a worker issue — old jobs submitted under a broken config will likely fail anyway.

**App-side job tracking pitfall:** If your web app tracks jobs in-memory and you purge RunPod's queue via API, the app still shows jobs as IN_QUEUE. The status poll (`GET /status/{JOB_ID}`) will return `{"error": "request does not exist"}` for purged jobs. Handle this in your polling logic by marking such jobs as CANCELLED.

### Endpoint Goes 404 After Balance Runs Out

**When the RunPod account balance goes negative, serverless endpoints become unreachable.** The REST API (`api.runpod.ai/v2/{ENDPOINT_ID}/*`) returns `{"message": "Not Found"}` for ALL operations (health, run, status). The GraphQL API still shows the endpoint exists. This is NOT a misconfiguration — it's RunPod disabling the endpoint due to unpaid balance.

**Even after adding funds, the old endpoint stays broken.** The fix is to delete and recreate:
```bash
# Delete broken endpoint
{"query": "mutation { deleteEndpoint(id: \"OLD_ENDPOINT_ID\") }"}

# Create fresh one with same config
{"query": "mutation { saveEndpoint(input: { name: \"my-endpoint\", templateId: \"TEMPLATE_ID\", networkVolumeId: \"VOLUME_ID\", gpuIds: \"AMPERE_48,ADA_48_PRO,AMPERE_80,ADA_80_PRO\", workersMin: 0, workersMax: 1, idleTimeout: 5 }) { id name } }"}
```

**After recreating**, update the endpoint ID everywhere it's referenced (web app config, environment variables, etc.) and restart the app.

**Ongoing storage costs**: Network volumes cost ~$0.047/hr per 50GB ($34/month) even when no workers are running. Monitor `currentSpendPerHr` — if it's non-zero with no active pods, that's the volume storage cost slowly draining the balance.

### Model Size Verification — ALWAYS Check Before Provisioning

**NEVER trust model size estimates in comments or documentation.** Always verify actual download size before choosing network volume or container disk sizes:

```bash
# Check actual model size via HuggingFace API
curl -s "https://huggingface.co/api/models/ORG/MODEL/tree/main" | python3 -c "
import sys, json
files = json.load(sys.stdin)
total = 0
for f in files:
    sz = f.get('size', 0)
    if sz > 100_000_000:
        print(f'  {f[\"path\"]:60s} {sz/1e9:.2f} GB')
    total += sz
print(f'Total: {total/1e9:.2f} GB ({len(files)} files)')
"
```

**Real-world example**: Wan2.1-T2V-14B was documented as "~26GB" but is actually **69GB**. This caused a 50GB network volume to be insufficient, leading to silent crash loops.

**HuggingFace download doubles disk usage**: `huggingface_hub` downloads to `HF_HOME` cache first, then copies/symlinks to `local_dir`. If `HF_HOME` and `local_dir` are on the same volume AND the FS doesn't support reflinks, you need **2x the model size** in free space. Set `HF_HOME` to the same volume as `local_dir` and use `snapshot_download(local_dir=..., local_dir_use_symlinks=True)` to avoid doubling.

### Worker Crash Loop Detection

**Symptom**: Worker shows `running=1` in health, but jobs never move from `IN_QUEUE` to `IN_PROGRESS`. The worker periodically flips between `running` and `idle/ready`.

**What's happening**: The handler's `__main__` block runs `ensure_models()` (or similar startup code) BEFORE calling `runpod.serverless.start()`. If the startup code fails (disk full, OOM, download error), the process crashes. RunPod auto-restarts the worker, which tries again and fails again — infinite crash loop.

**Key indicators**:
- `delayTime` on jobs grows to hundreds of thousands of ms (minutes to hours)
- `currentSpendPerHr` stays at just the storage rate (no GPU billing = workers aren't running long enough to bill)
- Health periodically shows `ready=1, idle=1` (worker restarted, hasn't crashed yet) then goes back to `running=1` (trying startup code again)
- Zero completed AND zero failed jobs (handler never registered)

**Diagnosis steps**:
1. Check RunPod UI logs (the only way to see worker stdout/stderr)
2. Calculate total model size vs volume/disk size (most common cause)
3. If models > volume size, the download fills the volume, crashes, volume cleanup on restart, repeat
4. Cancel all queued jobs — they'll never complete under a broken worker

### Exit Code Reference

| Code | Signal | Common Cause |
|------|--------|-------------|
| 1 | - | Python exception, missing module |
| 2 | - | File not found (handler.py missing) |
| 137 | SIGKILL | OOM killed (GPU VRAM or system RAM) |
| 139 | SIGSEGV | Segfault — usually disk full during large writes |

### Debugging Workflow (recommended order)

1. Check endpoint logs in RunPod UI (Logs tab) for actual error messages
2. Check health endpoint for worker states — `throttled` = no GPU capacity, `unhealthy` = crashed
3. Check `containerDiskInGb` — if worker crashes during startup, it's almost always disk space
4. If workers never appear (all zeros), it's GPU availability — add more GPU tiers or remove network volume
5. If worker shows `running` but jobs stay `IN_QUEUE` for 15+ min, the startup code (model download) may be hanging or failing silently

## Monitoring & Status Checks

### Full Infrastructure Audit

To check everything at once, query these in parallel:

```bash
# 1. List all endpoints with config
{"query": "{ myself { endpoints { id name templateId workersMin workersMax gpuIds networkVolumeId idleTimeout } } }"}

# 2. List all pods (persistent)
{"query": "{ myself { pods { id name desiredStatus imageName volumeInGb volumeMountPath } } }"}

# 3. Check balance
{"query": "{ myself { clientBalance } }"}
```

### GraphQL Field Gotchas

These fields DO NOT EXIST (despite seeming logical):
- `myself.serverlessWorkers` — no such field on User type
- `endpoint.jobs` — no such field on Endpoint type (use REST API for job status)
- `endpoint.workers` — no such field (use REST health endpoint for worker counts)
- `myself.serverlessTemplates` — no such field (templates are queried via `endpoint.template`)

To get endpoint template details:
```bash
{"query": "{ myself { endpoints { id name template { id name imageName dockerArgs env { key value } } } } }"}
```

### REST API Endpoints (for jobs + health)

**IMPORTANT**: Job/health REST API uses `api.runpod.ai`, NOT `api.runpod.io` (GraphQL).

```bash
# Health check (worker counts)
GET https://api.runpod.ai/v2/{ENDPOINT_ID}/health
Headers: Authorization: Bearer {API_KEY}

# Submit job
POST https://api.runpod.ai/v2/{ENDPOINT_ID}/run

# Check job status  
GET https://api.runpod.ai/v2/{ENDPOINT_ID}/status/{JOB_ID}

# Purge queue
POST https://api.runpod.ai/v2/{ENDPOINT_ID}/purge-queue
```

**Note**: The health endpoint may return `{"message": "Not Found"}` if the endpoint has never had a successful worker start. This doesn't mean the endpoint is misconfigured — it means no worker has ever registered.

### Checking Network Volume Contents

**You cannot browse network volume contents remotely when all pods are stopped.** The only ways to verify what's on a volume:
1. Start a pod with the volume attached and SSH in
2. Trigger a serverless cold start (submit a test job) and check the worker logs
3. The download_models.py pattern (check file existence → skip if present) serves as implicit verification on each cold start

### Docker Hub Image Verification

Check if your worker image was pushed correctly:
```bash
curl -s "https://hub.docker.com/v2/repositories/{USER}/{REPO}/tags/?page_size=5" | python3 -m json.tool
```

Key fields: `full_size` (compressed), `last_updated`. Compare compressed size to expected — e.g., an 8.7GB compressed image with just code+deps (no baked models) vs ~30GB+ with models baked in.

## Useful Queries

```bash
# Check balance
{"query": "{ myself { clientBalance } }"}

# List endpoints
{"query": "{ myself { endpoints { id name gpuIds workersMin workersMax } } }"}

# Update template (include id field)
{"query": "mutation { saveTemplate(input: { id: \"TEMPLATE_ID\", ... }) { id } }"}
```

## Session-specific management/debugging notes absorbed from narrower RunPod skills

### Template updates
- The mutation is `saveTemplate`, not `updateTemplate`.
- Even partial template edits still require fields such as `containerDiskInGb`, `volumeInGb`, and `env`.
- For custom images with a real CMD/ENTRYPOINT, set `dockerArgs: ""` so the image entrypoint is not overridden.
- Updating a template alone is not enough; trigger a fresh endpoint release with `saveEndpoint` after template changes.

### Crash-loop and storage diagnosis
- If workers look alive but jobs stay `IN_QUEUE`, suspect startup code before `runpod.serverless.start()` — often model downloads, disk exhaustion, or failed repo/bootstrap logic.
- Verify actual model sizes before provisioning. Documentation estimates were sometimes dramatically wrong.
- Remember HuggingFace download/cache behavior can temporarily require close to 2x model size unless cache/local-dir choices avoid duplication.
- Network volumes on serverless mount at `/runpod-volume/`, not `/workspace/`.
- Container disk must fit unpacked image layers plus runtime writes; compressed image size is a misleading lower bound.

### Operational failure modes worth checking early
- Negative account balance can make serverless REST endpoints return `Not Found` even though GraphQL still shows the endpoint.
- If GPU availability is poor in a pinned datacenter, adding more compatible GPU tiers is often the fastest fix.
- When using runtime `git clone` in `dockerArgs`, private-repo auth failures can masquerade as missing-handler exit code 2 problems. Prefer baking the handler into the image.