---
name: runpod-model-storage-debugging
category: mlops
description: Debug RunPod serverless endpoint crash loops caused by model download failures and storage constraints.
---

# RunPod Model Storage Debugging

## Trigger
Use when a RunPod serverless endpoint shows unhealthy workers, jobs stuck in queue, or workers that start then immediately exit/crash.

## Common Causes

### 1. Insufficient Account Balance
**Symptoms:**
- Workers fail to initialize
- Jobs stay IN_QUEUE forever then expire
- Account balance shows negative or near-zero

**Fix:**
```bash
# Check balance on RunPod dashboard
# Add funds (even $5-10 for testing)
```

### 2. Network Volume Too Small for Models
**Symptoms:**
- Worker starts, begins downloading models, then crashes
- Restart loop (worker exits, respawns, repeats)
- Logs show download progress then sudden termination

**Critical Fact:** Model sizes on HuggingFace are often understated in documentation. Always verify actual download size:
- Wan2.1-T2V-14B: ~69GB (NOT 26GB as some docs claim)
- T5-XXL encoder: ~11GB (separate download)
- MoCha checkpoint: ~2GB

**Diagnosis:**
1. Create a temporary pod attached to the network volume
2. Check available space: `df -h /workspace`
3. Compare to total model sizes needed

**Solutions:**
- Increase network volume size (100GB+ for 71GB models)
- Use smaller model variant (e.g., Wan2.1-T2V-1.3B instead of 14B)
- Bake models into Docker image (no volume needed, but ~80GB image)
- Skip unnecessary model components (e.g., T5 encoder if not used)

### 3. Docker Image Issues
**Check:**
- Image exists on Docker Hub: `docker pull firemountain/your-image:latest`
- Image size matches expectations (code+deps only vs. models baked in)
- ENTRYPOINT and CMD are correct

## Debugging Workflow

1. **Check account balance** - RunPod Dashboard → Billing
2. **Check endpoint config** - verify network volume attached, GPU types, min/max workers
3. **Check worker logs** - Serverless → Endpoint → Logs tab
4. **Check network volume** - create temp pod, inspect `/workspace` contents
5. **Verify model sizes** - check HuggingFace repo files, sum actual sizes
6. **Test with minimal job** - submit hello-world to verify infrastructure works

## Cost Notes
- Network volume: ~$0.047/hr per 50GB = ~$34/month (charged even when idle)
- GPU workers: $0.69-2.49/hr depending on GPU type (only charged when running)
- Serverless idle timeout: 5s default, set to 0 to keep worker warm

## Pitfalls
- Never assume model file sizes from README comments - verify on HuggingFace
- Network volumes are region-locked (US-TX-3 can only attach to US-TX-3 pods)
- Serverless endpoints scale to 0 when idle - first cold start includes model download time
- Unhealthy worker flag often means a cold start failed, not a persistent issue