A 6-part series investigating why Docker containers use 20-30GB more memory than native execution on Grace Blackwell’s unified memory architecture.
Overview
When YouTube reviewers complained about DGX Spark performance, they blamed the hardware. I dug deeper and found the real culprit: Docker’s cgroups double-counting unified memory.
Key Findings
- Container overhead: 20-30 GB more memory usage in Docker
- KV cache reduction: 40-63% less KV cache in containers (1.7-2.7x reduction)
- Performance: Identical throughput - no speed penalty
- Root cause: Docker’s cgroups double-count unified memory on Grace Blackwell
- Solution: Use native execution for large models on unified memory architectures
The Series
- The Mystery - YouTubers blamed NVIDIA hardware without technical analysis
- MPI and Chroot Nightmare - Setting up proper test environments
- The Unified Memory Revelation - Why Docker’s cgroups double-count unified memory
- The Data: 60 Runs Don’t Lie - Comprehensive benchmark results across 3 models
- What I Learned (And What’s Next) - Conclusions and Phase 2 preview
- KV Cache Deep Dive - The 2x reduction mystery explained
Related Resources
- Benchmark Code & Data: benchmark-spark
- Interactive Results: brandonrc.github.io/benchmark-spark
Impact
This investigation demonstrates the importance of understanding the full stack - from hardware architecture to kernel subsystems to container runtimes. What looked like a hardware problem was actually a software architecture mismatch.