runpod-model-storage-debugging

/home/avalon/.hermes/skills/.archive/mlops/runpod-model-storage-debugging/SKILL.md · raw

RunPod Model Storage Debugging

Trigger

Use when a RunPod serverless endpoint shows unhealthy workers, jobs stuck in queue, or workers that start then immediately exit/crash.

Common Causes

1. Insufficient Account Balance

Symptoms: - Workers fail to initialize - Jobs stay IN_QUEUE forever then expire - Account balance shows negative or near-zero

Fix:

# Check balance on RunPod dashboard
# Add funds (even $5-10 for testing)

2. Network Volume Too Small for Models

Symptoms: - Worker starts, begins downloading models, then crashes - Restart loop (worker exits, respawns, repeats) - Logs show download progress then sudden termination

Critical Fact: Model sizes on HuggingFace are often understated in documentation. Always verify actual download size: - Wan2.1-T2V-14B: ~69GB (NOT 26GB as some docs claim) - T5-XXL encoder: ~11GB (separate download) - MoCha checkpoint: ~2GB

Diagnosis: 1. Create a temporary pod attached to the network volume 2. Check available space: df -h /workspace 3. Compare to total model sizes needed

Solutions: - Increase network volume size (100GB+ for 71GB models) - Use smaller model variant (e.g., Wan2.1-T2V-1.3B instead of 14B) - Bake models into Docker image (no volume needed, but ~80GB image) - Skip unnecessary model components (e.g., T5 encoder if not used)

3. Docker Image Issues

Check: - Image exists on Docker Hub: docker pull firemountain/your-image:latest - Image size matches expectations (code+deps only vs. models baked in) - ENTRYPOINT and CMD are correct

Debugging Workflow

Check account balance - RunPod Dashboard → Billing
Check endpoint config - verify network volume attached, GPU types, min/max workers
Check worker logs - Serverless → Endpoint → Logs tab
Check network volume - create temp pod, inspect /workspace contents
Verify model sizes - check HuggingFace repo files, sum actual sizes
Test with minimal job - submit hello-world to verify infrastructure works

Cost Notes

Network volume: ~$0.047/hr per 50GB = ~$34/month (charged even when idle)
GPU workers: $0.69-2.49/hr depending on GPU type (only charged when running)
Serverless idle timeout: 5s default, set to 0 to keep worker warm

Pitfalls

Never assume model file sizes from README comments - verify on HuggingFace
Network volumes are region-locked (US-TX-3 can only attach to US-TX-3 pods)
Serverless endpoints scale to 0 when idle - first cold start includes model download time
Unhealthy worker flag often means a cold start failed, not a persistent issue