--- name: runpod-serverless description: Deploy GPU workloads on RunPod Serverless — create templates, endpoints, network volumes, and handler workers via GraphQL API. Scales to zero when idle. Use when user needs on-demand GPU for ML inference behind a web app. version: 1.0.0 tags: [runpod, serverless, gpu, inference, deployment, api] metadata: hermes: tags: [runpod, serverless, gpu, inference, deployment, api] --- # RunPod Serverless Deployment Deploy GPU inference workloads that scale to zero. Worker spins up on job submit, processes, scales back down. ## When to Use - User needs GPU inference (image gen, video processing, LLM, etc.) - On-demand GPU without always-on costs - Wrapping a GitHub ML project as an API service ## Architecture Pattern ``` Web App (VPS) ──POST /run──> RunPod Endpoint ──> Worker (GPU) │ │ │ │ <──poll /status/{id}── │ downloads │ │ │ from URLs │ │ <──result (video URL)── │ │ ``` The web app on VPS handles file uploads, stores them locally (served via Express static), sends URLs to RunPod. Worker downloads files, processes, uploads result to S3 or returns base64. ## API Reference Base: `https://api.runpod.io/graphql` (management) and `https://api.runpod.ai/v2/{ENDPOINT_ID}` (jobs) Auth header: `Authorization: Bearer {API_KEY}` ### Job Submission ``` POST https://api.runpod.ai/v2/{ENDPOINT_ID}/run {"input": { ... }} Response: {"id": "job-uuid", "status": "IN_QUEUE"} ``` ### Job Status ``` GET https://api.runpod.ai/v2/{ENDPOINT_ID}/status/{JOB_ID} Response: {"id": "...", "status": "COMPLETED", "output": { ... }} ``` Statuses: IN_QUEUE → IN_PROGRESS → COMPLETED | FAILED | CANCELLED ### Endpoint Health ``` GET https://api.runpod.ai/v2/{ENDPOINT_ID}/health ``` ## Setup Steps ### 1. Create Network Volume (for model caching) ```bash # Write query to file to avoid shell escaping hell cat > /tmp/rp_query.json << 'EOF' {"query": "mutation { createNetworkVolume(input: { name: \"my-models\", size: 50, dataCenterId: \"US-TX-3\" }) { id name size } }"} EOF curl -s -H "Authorization: Bearer $RUNPOD_KEY" \ https://api.runpod.io/graphql \ -H "Content-Type: application/json" \ -d @/tmp/rp_query.json ``` ### 2. Create Template **IMPORTANT**: `dockerArgs` is a REQUIRED field (not optional). ```bash cat > /tmp/rp_template.json << 'EOF' {"query": "mutation { saveTemplate(input: { name: \"My Worker\", imageName: \"runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04\", dockerArgs: \"bash -c 'if [ ! -d /runpod-volume/handler-repo ]; then git clone https://github.com/USER/REPO.git /runpod-volume/handler-repo; fi && bash /runpod-volume/handler-repo/start.sh'\", containerDiskInGb: 20, volumeInGb: 0, isServerless: true, startJupyter: false, startSsh: false, volumeMountPath: \"/runpod-volume\", env: [{key: \"MODEL_DIR\", value: \"/runpod-volume/models\"}, {key: \"HF_HOME\", value: \"/runpod-volume/huggingface\"}] }) { id name } }"} EOF curl -s -H "Authorization: Bearer $RUNPOD_KEY" \ https://api.runpod.io/graphql \ -H "Content-Type: application/json" \ -d @/tmp/rp_template.json ``` ### 3. Create Endpoint ```bash cat > /tmp/rp_endpoint.json << 'EOF' {"query": "mutation { saveEndpoint(input: { name: \"my-endpoint\", templateId: \"TEMPLATE_ID\", networkVolumeId: \"VOLUME_ID\", gpuIds: \"AMPERE_48\", workersMin: 0, workersMax: 1, idleTimeout: 5, scalerType: \"QUEUE_DELAY\", scalerValue: 4 }) { id name gpuIds workersMin workersMax } }"} EOF curl -s -H "Authorization: Bearer $RUNPOD_KEY" \ https://api.runpod.io/graphql \ -H "Content-Type: application/json" \ -d @/tmp/rp_endpoint.json ``` ### GPU Tiers | Tier | VRAM | Example GPUs | Cost/sec (flex) | |------|------|-------------|-----------------| | AMPERE_24 | 24GB | RTX 4090 | ~$0.00031 | | AMPERE_48 | 48GB | L40S | ~$0.00053 | | AMPERE_80 | 80GB | A100 | ~$0.00076 | ## Handler Pattern (handler.py) ```python import runpod import os, requests, subprocess def handler(job): input_data = job["input"] # Download input files from URLs video_path = download(input_data["video_url"], "/tmp/input.mp4") # Run inference runpod.serverless.progress_update(job, "Processing...") result = subprocess.run(["python", "inference.py", ...], ...) # Return result (URL or base64) return {"status": "completed", "output_url": "..."} if __name__ == "__main__": # Download models on cold start (cached on network volume) ensure_models() runpod.serverless.start({"handler": handler}) ``` ## Bootstrap Script Pattern (start.sh) Handles first-run setup on network volume: ```bash #!/bin/bash set -e # System deps if ! command -v ffmpeg &>/dev/null; then apt-get update && apt-get install -y ffmpeg libgl1 libglib2.0-0 fi # Clone ML repo (cached on volume) if [ ! -d /runpod-volume/MyProject ]; then git clone https://github.com/ORG/PROJECT.git /runpod-volume/MyProject fi # Install deps (cached via flag file) if [ ! -f /runpod-volume/.deps_installed_v1 ]; then pip install -r /runpod-volume/MyProject/requirements.txt pip install runpod boto3 requests huggingface_hub touch /runpod-volume/.deps_installed_v1 fi # Update handler (always fresh) cd /runpod-volume/handler-repo && git pull 2>/dev/null || true # Run exec python -u /runpod-volume/handler-repo/handler.py ``` ## File Handling RunPod does NOT support multipart uploads. Two approaches: 1. **URL-based (recommended)**: Web app stores files locally, serves via Express static, sends URLs to worker. Worker downloads via HTTP. 2. **Base64**: Encode files in request body. Bad for large files (video). For results: upload to S3, or return base64 for small files. ## Pitfalls 1. **ALWAYS write GraphQL queries to a JSON file** and use `curl -d @file`. Shell escaping of nested quotes + bash inside dockerArgs inside JSON is impossible to get right inline. 2. **dockerArgs is REQUIRED** in saveTemplate — omitting it causes a validation error. 3. **Field names differ from docs**: `flashBoot` → use nothing (not a valid field), `creditBalance` → `clientBalance`. Test with introspection or trial/error. 4. **Network volume + model caching**: Don't bake 30GB+ models into Docker images. Use a network volume with first-run download + flag file caching. 5. **Bootstrap chicken-and-egg**: dockerArgs must clone the handler repo if it doesn't exist on the volume yet. The start.sh lives IN the repo that gets cloned. 6. **Cold start**: First ever start downloads models (~15-20 min for large models). Subsequent cold starts: ~2-5 min (deps + repo already cached, just container startup). 7. **Can't build large Docker images on small VPS**: If disk < model size, use the network volume approach instead of baking models into the image. 8. **Certbot on VPS may fail first try**: `ConnectionResetError` is common. Retry after `sleep 3`. 9. **Nginx for large uploads**: Add `client_max_body_size 200M;` for video upload apps. 10. **Private repo cloning in dockerArgs**: If the handler repo is private, `git clone` silently fails (especially with `|| true`). The worker starts but can't find handler.py → `exit code 2` + "No such file or directory". **NEVER hardcode GitHub tokens in dockerArgs** — they're visible in plaintext in the template config. Best solutions (in order of preference): - **Build a Docker image with the handler baked in** (most robust — no cloning, no auth, faster cold starts) - Make the repo public if the handler code isn't sensitive - Add a fine-grained GitHub PAT (read-only, scoped to single repo) as a RunPod **environment variable** (e.g., `GH_TOKEN`) and reference it: `git clone https://$GH_TOKEN@github.com/USER/REPO.git ...` - **Never embed tokens directly in dockerArgs** — compromises the entire GitHub account 11. **Serverless vs Pods**: For sporadic/bursty workloads (video processing, image gen), serverless is almost always the right choice (pay-per-job, auto-scale, no idle costs). Only use persistent pods for sustained inference or when you need SSH debugging access. Don't switch architectures just because serverless isn't working — diagnose the actual issue first. ## Debugging Worker Failures Common error: `worker exited with exit code 2` — usually means the Python handler script wasn't found. **Diagnosis checklist:** 1. Check RunPod endpoint logs for the actual error message 2. Verify the `dockerArgs` or Dockerfile copies/clones the handler to the right path 3. If using `git clone` in dockerArgs, test the clone URL manually (private repo? auth needed?) 4. The `|| true` pattern in bash masks clone failures — remove it temporarily to see errors 5. Compare the path in the CMD/exec line with where the handler actually ends up **Querying template config via API** (to inspect dockerArgs remotely): ```bash {"query": "{ myself { endpoints { id name template { id name imageName dockerArgs env { key value } } } } }"} ``` ### Pre-built Docker Image vs Runtime dockerArgs **Prefer building a Docker image** with the handler baked in over cloning repos at runtime in `dockerArgs`. The runtime approach is fragile: - Private repos fail silently with `|| true` - Runtime `pip install` and `git clone` add minutes to every cold start - `dockerArgs` bash scripts are hard to debug and maintain **Better approach:** 1. Write a proper Dockerfile that COPYs handler.py and installs deps at build time 2. Build and push to Docker Hub: `docker build -t user/worker:latest . && docker push user/worker:latest` 3. Update the RunPod template to use that image with empty `dockerArgs` 4. The Dockerfile CMD is the single source of truth for the entrypoint This gives faster cold starts, no auth issues, and reproducible builds. Only use the public MoCha/model repos — your handler code gets baked into the image, no private repo cloning needed on the worker. ### GPU Availability and Network Volumes **Network volumes pin workers to a specific datacenter.** If the GPU tier you selected isn't available in that datacenter, workers will simply never start — the health endpoint shows 0 workers across all states (not even initializing). **Symptoms of GPU unavailability:** - Jobs sit IN_QUEUE indefinitely with massive `delayTime` (hours) - Health shows: `workers: {idle: 0, initializing: 0, ready: 0, running: 0, unhealthy: 0}` despite queued jobs - No error messages — just silent non-scheduling - Workers may briefly show `throttled: 1` — this means RunPod tried to provision but no capacity exists **Fixes:** 1. Add ALL compatible GPU tiers: `gpuIds: "AMPERE_48,ADA_48_PRO,AMPERE_80,ADA_80_PRO"` (comma-separated, dramatically increases availability) 2. Try removing the network volume temporarily to allow any datacenter: `networkVolumeId: ""` in saveEndpoint 3. Create a new network volume in a busier datacenter — use trial-and-error with `createNetworkVolume` mutation (not all datacenters have storage clusters) 4. Check GPU availability: `{ gpuTypes { id displayName memoryInGb secureCloud communityCloud } }` 5. Available GPU tiers for serverless: `AMPERE_16`, `AMPERE_24`, `ADA_24`, `AMPERE_48`, `ADA_48_PRO`, `AMPERE_80`, `ADA_80_PRO` **Worker health states explained:** - `initializing` → container starting, pulling image - `running` → either processing a job OR still in startup code (model downloads happen BEFORE `runpod.serverless.start()`) - `ready` → idle, waiting for jobs (handler started, model download complete) - `unhealthy` → worker crashed. Common causes: OOM from too-small GPU, disk space exhaustion (exit code 139 = SIGSEGV, usually disk full) - `throttled` → RunPod tried to provision a GPU but none available in the pinned datacenter **CRITICAL: "running" does NOT mean "processing jobs"** When a worker shows `running` but jobs stay `IN_QUEUE`, the worker is still in its startup/initialization phase (downloading models, installing deps). The job won't move to `IN_PROGRESS` until `runpod.serverless.start()` is called. For large model downloads (~28GB), this can take 10-15+ minutes. ### Network Volume Mount Path **On RunPod Serverless, network volumes mount at `/runpod-volume/`, NOT `/workspace/`.** This is a critical difference from RunPod Pods where volumes mount at `/workspace/`. If your env vars point model storage to `/workspace/`, models will download to the container disk instead of the persistent volume, causing: - Disk space exhaustion (container disk is typically 20-50GB, models can be 30GB+) - Worker crashes with exit code 139 (SIGSEGV from disk full) - Models re-downloading on every cold start (defeating the purpose of the volume) **Always set storage paths to `/runpod-volume/` for serverless:** ``` MODEL_DIR=/runpod-volume/mocha-models HF_HOME=/runpod-volume/huggingface ``` ### Container Disk Sizing The container disk must fit the **unpacked** Docker image layers (2-3x the compressed size) PLUS any runtime files. Docker's overlay2 filesystem needs space for all layers unpacked. **Critical: compressed image size ≠ disk space needed.** A 16GB compressed image can need 30-40GB unpacked. The error `no space left on device` during container startup (before any code runs) means the container disk is too small for just the image layers. Sizing guide: - **With network volume**: container disk = `(compressed image × 2.5) + 20GB buffer`. Models go on the volume. - **Without network volume**: container disk = `(compressed image × 2.5) + models + 20GB buffer` Example with 16GB compressed image + 28GB models: - With volume: 100GB container disk (image unpacked ~40GB + buffer) - Without volume: 200GB container disk (40GB image + 28GB models + buffer) **When in doubt, use 200GB container disk.** The cost difference is negligible for serverless (you only pay when running), and debugging disk space failures is painful because RunPod gives minimal error info. **Always prefer network volume for large models.** Container disk alone is fragile for 20GB+ model downloads. ### Updating Templates vs Releases **Updating a template via `saveTemplate` does NOT automatically update the active endpoint release.** You must also call `saveEndpoint` (with the endpoint `id`) to trigger a new release that picks up the template changes. The RunPod UI shows a release history — check it to verify your changes propagated. ```bash # Step 1: Update template saveTemplate(input: { id: "TEMPLATE_ID", imageName: "new-image:latest", ... }) # Step 2: Trigger new release on endpoint saveEndpoint(input: { id: "ENDPOINT_ID", name: "...", templateId: "TEMPLATE_ID", ... }) ``` ### Queue Management ```bash # Purge all queued jobs POST https://api.runpod.ai/v2/{ENDPOINT_ID}/purge-queue Response: {"removed": N, "status": "completed"} # Cancel a specific job POST https://api.runpod.ai/v2/{ENDPOINT_ID}/cancel/{JOB_ID} ``` Always purge stale jobs after fixing a worker issue — old jobs submitted under a broken config will likely fail anyway. **App-side job tracking pitfall:** If your web app tracks jobs in-memory and you purge RunPod's queue via API, the app still shows jobs as IN_QUEUE. The status poll (`GET /status/{JOB_ID}`) will return `{"error": "request does not exist"}` for purged jobs. Handle this in your polling logic by marking such jobs as CANCELLED. ### Endpoint Goes 404 After Balance Runs Out **When the RunPod account balance goes negative, serverless endpoints become unreachable.** The REST API (`api.runpod.ai/v2/{ENDPOINT_ID}/*`) returns `{"message": "Not Found"}` for ALL operations (health, run, status). The GraphQL API still shows the endpoint exists. This is NOT a misconfiguration — it's RunPod disabling the endpoint due to unpaid balance. **Even after adding funds, the old endpoint stays broken.** The fix is to delete and recreate: ```bash # Delete broken endpoint {"query": "mutation { deleteEndpoint(id: \"OLD_ENDPOINT_ID\") }"} # Create fresh one with same config {"query": "mutation { saveEndpoint(input: { name: \"my-endpoint\", templateId: \"TEMPLATE_ID\", networkVolumeId: \"VOLUME_ID\", gpuIds: \"AMPERE_48,ADA_48_PRO,AMPERE_80,ADA_80_PRO\", workersMin: 0, workersMax: 1, idleTimeout: 5 }) { id name } }"} ``` **After recreating**, update the endpoint ID everywhere it's referenced (web app config, environment variables, etc.) and restart the app. **Ongoing storage costs**: Network volumes cost ~$0.047/hr per 50GB ($34/month) even when no workers are running. Monitor `currentSpendPerHr` — if it's non-zero with no active pods, that's the volume storage cost slowly draining the balance. ### Model Size Verification — ALWAYS Check Before Provisioning **NEVER trust model size estimates in comments or documentation.** Always verify actual download size before choosing network volume or container disk sizes: ```bash # Check actual model size via HuggingFace API curl -s "https://huggingface.co/api/models/ORG/MODEL/tree/main" | python3 -c " import sys, json files = json.load(sys.stdin) total = 0 for f in files: sz = f.get('size', 0) if sz > 100_000_000: print(f' {f[\"path\"]:60s} {sz/1e9:.2f} GB') total += sz print(f'Total: {total/1e9:.2f} GB ({len(files)} files)') " ``` **Real-world example**: Wan2.1-T2V-14B was documented as "~26GB" but is actually **69GB**. This caused a 50GB network volume to be insufficient, leading to silent crash loops. **HuggingFace download doubles disk usage**: `huggingface_hub` downloads to `HF_HOME` cache first, then copies/symlinks to `local_dir`. If `HF_HOME` and `local_dir` are on the same volume AND the FS doesn't support reflinks, you need **2x the model size** in free space. Set `HF_HOME` to the same volume as `local_dir` and use `snapshot_download(local_dir=..., local_dir_use_symlinks=True)` to avoid doubling. ### Worker Crash Loop Detection **Symptom**: Worker shows `running=1` in health, but jobs never move from `IN_QUEUE` to `IN_PROGRESS`. The worker periodically flips between `running` and `idle/ready`. **What's happening**: The handler's `__main__` block runs `ensure_models()` (or similar startup code) BEFORE calling `runpod.serverless.start()`. If the startup code fails (disk full, OOM, download error), the process crashes. RunPod auto-restarts the worker, which tries again and fails again — infinite crash loop. **Key indicators**: - `delayTime` on jobs grows to hundreds of thousands of ms (minutes to hours) - `currentSpendPerHr` stays at just the storage rate (no GPU billing = workers aren't running long enough to bill) - Health periodically shows `ready=1, idle=1` (worker restarted, hasn't crashed yet) then goes back to `running=1` (trying startup code again) - Zero completed AND zero failed jobs (handler never registered) **Diagnosis steps**: 1. Check RunPod UI logs (the only way to see worker stdout/stderr) 2. Calculate total model size vs volume/disk size (most common cause) 3. If models > volume size, the download fills the volume, crashes, volume cleanup on restart, repeat 4. Cancel all queued jobs — they'll never complete under a broken worker ### Exit Code Reference | Code | Signal | Common Cause | |------|--------|-------------| | 1 | - | Python exception, missing module | | 2 | - | File not found (handler.py missing) | | 137 | SIGKILL | OOM killed (GPU VRAM or system RAM) | | 139 | SIGSEGV | Segfault — usually disk full during large writes | ### Debugging Workflow (recommended order) 1. Check endpoint logs in RunPod UI (Logs tab) for actual error messages 2. Check health endpoint for worker states — `throttled` = no GPU capacity, `unhealthy` = crashed 3. Check `containerDiskInGb` — if worker crashes during startup, it's almost always disk space 4. If workers never appear (all zeros), it's GPU availability — add more GPU tiers or remove network volume 5. If worker shows `running` but jobs stay `IN_QUEUE` for 15+ min, the startup code (model download) may be hanging or failing silently ## Monitoring & Status Checks ### Full Infrastructure Audit To check everything at once, query these in parallel: ```bash # 1. List all endpoints with config {"query": "{ myself { endpoints { id name templateId workersMin workersMax gpuIds networkVolumeId idleTimeout } } }"} # 2. List all pods (persistent) {"query": "{ myself { pods { id name desiredStatus imageName volumeInGb volumeMountPath } } }"} # 3. Check balance {"query": "{ myself { clientBalance } }"} ``` ### GraphQL Field Gotchas These fields DO NOT EXIST (despite seeming logical): - `myself.serverlessWorkers` — no such field on User type - `endpoint.jobs` — no such field on Endpoint type (use REST API for job status) - `endpoint.workers` — no such field (use REST health endpoint for worker counts) - `myself.serverlessTemplates` — no such field (templates are queried via `endpoint.template`) To get endpoint template details: ```bash {"query": "{ myself { endpoints { id name template { id name imageName dockerArgs env { key value } } } } }"} ``` ### REST API Endpoints (for jobs + health) **IMPORTANT**: Job/health REST API uses `api.runpod.ai`, NOT `api.runpod.io` (GraphQL). ```bash # Health check (worker counts) GET https://api.runpod.ai/v2/{ENDPOINT_ID}/health Headers: Authorization: Bearer {API_KEY} # Submit job POST https://api.runpod.ai/v2/{ENDPOINT_ID}/run # Check job status GET https://api.runpod.ai/v2/{ENDPOINT_ID}/status/{JOB_ID} # Purge queue POST https://api.runpod.ai/v2/{ENDPOINT_ID}/purge-queue ``` **Note**: The health endpoint may return `{"message": "Not Found"}` if the endpoint has never had a successful worker start. This doesn't mean the endpoint is misconfigured — it means no worker has ever registered. ### Checking Network Volume Contents **You cannot browse network volume contents remotely when all pods are stopped.** The only ways to verify what's on a volume: 1. Start a pod with the volume attached and SSH in 2. Trigger a serverless cold start (submit a test job) and check the worker logs 3. The download_models.py pattern (check file existence → skip if present) serves as implicit verification on each cold start ### Docker Hub Image Verification Check if your worker image was pushed correctly: ```bash curl -s "https://hub.docker.com/v2/repositories/{USER}/{REPO}/tags/?page_size=5" | python3 -m json.tool ``` Key fields: `full_size` (compressed), `last_updated`. Compare compressed size to expected — e.g., an 8.7GB compressed image with just code+deps (no baked models) vs ~30GB+ with models baked in. ## Useful Queries ```bash # Check balance {"query": "{ myself { clientBalance } }"} # List endpoints {"query": "{ myself { endpoints { id name gpuIds workersMin workersMax } } }"} # Update template (include id field) {"query": "mutation { saveTemplate(input: { id: \"TEMPLATE_ID\", ... }) { id } }"} ``` ## Session-specific management/debugging notes absorbed from narrower RunPod skills ### Template updates - The mutation is `saveTemplate`, not `updateTemplate`. - Even partial template edits still require fields such as `containerDiskInGb`, `volumeInGb`, and `env`. - For custom images with a real CMD/ENTRYPOINT, set `dockerArgs: ""` so the image entrypoint is not overridden. - Updating a template alone is not enough; trigger a fresh endpoint release with `saveEndpoint` after template changes. ### Crash-loop and storage diagnosis - If workers look alive but jobs stay `IN_QUEUE`, suspect startup code before `runpod.serverless.start()` — often model downloads, disk exhaustion, or failed repo/bootstrap logic. - Verify actual model sizes before provisioning. Documentation estimates were sometimes dramatically wrong. - Remember HuggingFace download/cache behavior can temporarily require close to 2x model size unless cache/local-dir choices avoid duplication. - Network volumes on serverless mount at `/runpod-volume/`, not `/workspace/`. - Container disk must fit unpacked image layers plus runtime writes; compressed image size is a misleading lower bound. ### Operational failure modes worth checking early - Negative account balance can make serverless REST endpoints return `Not Found` even though GraphQL still shows the endpoint. - If GPU availability is poor in a pinned datacenter, adding more compatible GPU tiers is often the fastest fix. - When using runtime `git clone` in `dockerArgs`, private-repo auth failures can masquerade as missing-handler exit code 2 problems. Prefer baking the handler into the image.