--- name: runpod-model-storage-debugging category: mlops description: Debug RunPod serverless endpoint crash loops caused by model download failures and storage constraints. --- # RunPod Model Storage Debugging ## Trigger Use when a RunPod serverless endpoint shows unhealthy workers, jobs stuck in queue, or workers that start then immediately exit/crash. ## Common Causes ### 1. Insufficient Account Balance **Symptoms:** - Workers fail to initialize - Jobs stay IN_QUEUE forever then expire - Account balance shows negative or near-zero **Fix:** ```bash # Check balance on RunPod dashboard # Add funds (even $5-10 for testing) ``` ### 2. Network Volume Too Small for Models **Symptoms:** - Worker starts, begins downloading models, then crashes - Restart loop (worker exits, respawns, repeats) - Logs show download progress then sudden termination **Critical Fact:** Model sizes on HuggingFace are often understated in documentation. Always verify actual download size: - Wan2.1-T2V-14B: ~69GB (NOT 26GB as some docs claim) - T5-XXL encoder: ~11GB (separate download) - MoCha checkpoint: ~2GB **Diagnosis:** 1. Create a temporary pod attached to the network volume 2. Check available space: `df -h /workspace` 3. Compare to total model sizes needed **Solutions:** - Increase network volume size (100GB+ for 71GB models) - Use smaller model variant (e.g., Wan2.1-T2V-1.3B instead of 14B) - Bake models into Docker image (no volume needed, but ~80GB image) - Skip unnecessary model components (e.g., T5 encoder if not used) ### 3. Docker Image Issues **Check:** - Image exists on Docker Hub: `docker pull firemountain/your-image:latest` - Image size matches expectations (code+deps only vs. models baked in) - ENTRYPOINT and CMD are correct ## Debugging Workflow 1. **Check account balance** - RunPod Dashboard → Billing 2. **Check endpoint config** - verify network volume attached, GPU types, min/max workers 3. **Check worker logs** - Serverless → Endpoint → Logs tab 4. **Check network volume** - create temp pod, inspect `/workspace` contents 5. **Verify model sizes** - check HuggingFace repo files, sum actual sizes 6. **Test with minimal job** - submit hello-world to verify infrastructure works ## Cost Notes - Network volume: ~$0.047/hr per 50GB = ~$34/month (charged even when idle) - GPU workers: $0.69-2.49/hr depending on GPU type (only charged when running) - Serverless idle timeout: 5s default, set to 0 to keep worker warm ## Pitfalls - Never assume model file sizes from README comments - verify on HuggingFace - Network volumes are region-locked (US-TX-3 can only attach to US-TX-3 pods) - Serverless endpoints scale to 0 when idle - first cold start includes model download time - Unhealthy worker flag often means a cold start failed, not a persistent issue