DGX Spark Deep Dive
A 6-part series investigating why Docker containers use 20-30GB more memory than native execution on Grace Blackwell’s unified memory architecture. Overview When YouTube reviewers complained about DGX Spark performance, they blamed the hardware. I dug deeper and found the real culprit: Docker’s cgroups double-counting unified memory. Key Findings Container overhead: 20-30 GB more memory usage in Docker KV cache reduction: 40-63% less KV cache in containers (1.7-2.7x reduction) Performance: Identical throughput - no speed penalty Root cause: Docker’s cgroups double-count unified memory on Grace Blackwell Solution: Use native execution for large models on unified memory architectures The Series The Mystery - YouTubers blamed NVIDIA hardware without technical analysis MPI and Chroot Nightmare - Setting up proper test environments The Unified Memory Revelation - Why Docker’s cgroups double-count unified memory The Data: 60 Runs Don’t Lie - Comprehensive benchmark results across 3 models What I Learned (And What’s Next) - Conclusions and Phase 2 preview KV Cache Deep Dive - The 2x reduction mystery explained Related Resources Benchmark Code & Data: benchmark-spark Interactive Results: brandonrc.github.io/benchmark-spark Impact This investigation demonstrates the importance of understanding the full stack - from hardware architecture to kernel subsystems to container runtimes. What looked like a hardware problem was actually a software architecture mismatch. ...