Use when a RunPod serverless endpoint shows unhealthy workers, jobs stuck in queue, or workers that start then immediately exit/crash.
Symptoms: - Workers fail to initialize - Jobs stay IN_QUEUE forever then expire - Account balance shows negative or near-zero
Fix:
# Check balance on RunPod dashboard
# Add funds (even $5-10 for testing)
Symptoms: - Worker starts, begins downloading models, then crashes - Restart loop (worker exits, respawns, repeats) - Logs show download progress then sudden termination
Critical Fact: Model sizes on HuggingFace are often understated in documentation. Always verify actual download size: - Wan2.1-T2V-14B: ~69GB (NOT 26GB as some docs claim) - T5-XXL encoder: ~11GB (separate download) - MoCha checkpoint: ~2GB
Diagnosis:
1. Create a temporary pod attached to the network volume
2. Check available space: df -h /workspace
3. Compare to total model sizes needed
Solutions: - Increase network volume size (100GB+ for 71GB models) - Use smaller model variant (e.g., Wan2.1-T2V-1.3B instead of 14B) - Bake models into Docker image (no volume needed, but ~80GB image) - Skip unnecessary model components (e.g., T5 encoder if not used)
Check:
- Image exists on Docker Hub: docker pull firemountain/your-image:latest
- Image size matches expectations (code+deps only vs. models baked in)
- ENTRYPOINT and CMD are correct
/workspace contents