runpod-model-storage-debugging

/home/avalon/.hermes/skills/.archive/mlops/runpod-model-storage-debugging/SKILL.md · raw

RunPod Model Storage Debugging

Trigger

Use when a RunPod serverless endpoint shows unhealthy workers, jobs stuck in queue, or workers that start then immediately exit/crash.

Common Causes

1. Insufficient Account Balance

Symptoms: - Workers fail to initialize - Jobs stay IN_QUEUE forever then expire - Account balance shows negative or near-zero

Fix:

# Check balance on RunPod dashboard
# Add funds (even $5-10 for testing)

2. Network Volume Too Small for Models

Symptoms: - Worker starts, begins downloading models, then crashes - Restart loop (worker exits, respawns, repeats) - Logs show download progress then sudden termination

Critical Fact: Model sizes on HuggingFace are often understated in documentation. Always verify actual download size: - Wan2.1-T2V-14B: ~69GB (NOT 26GB as some docs claim) - T5-XXL encoder: ~11GB (separate download) - MoCha checkpoint: ~2GB

Diagnosis: 1. Create a temporary pod attached to the network volume 2. Check available space: df -h /workspace 3. Compare to total model sizes needed

Solutions: - Increase network volume size (100GB+ for 71GB models) - Use smaller model variant (e.g., Wan2.1-T2V-1.3B instead of 14B) - Bake models into Docker image (no volume needed, but ~80GB image) - Skip unnecessary model components (e.g., T5 encoder if not used)

3. Docker Image Issues

Check: - Image exists on Docker Hub: docker pull firemountain/your-image:latest - Image size matches expectations (code+deps only vs. models baked in) - ENTRYPOINT and CMD are correct

Debugging Workflow

  1. Check account balance - RunPod Dashboard → Billing
  2. Check endpoint config - verify network volume attached, GPU types, min/max workers
  3. Check worker logs - Serverless → Endpoint → Logs tab
  4. Check network volume - create temp pod, inspect /workspace contents
  5. Verify model sizes - check HuggingFace repo files, sum actual sizes
  6. Test with minimal job - submit hello-world to verify infrastructure works

Cost Notes

Pitfalls