DGX Spark Deep Dive

A 6-part series investigating why Docker containers use 20-30GB more memory than native execution on Grace Blackwell’s unified memory architecture.

Overview

When YouTube reviewers complained about DGX Spark performance, they blamed the hardware. I dug deeper and found the real culprit: Docker’s cgroups double-counting unified memory.

Key Findings

Container overhead: 20-30 GB more memory usage in Docker
KV cache reduction: 40-63% less KV cache in containers (1.7-2.7x reduction)
Performance: Identical throughput - no speed penalty
Root cause: Docker’s cgroups double-count unified memory on Grace Blackwell
Solution: Use native execution for large models on unified memory architectures

The Series

The Mystery - YouTubers blamed NVIDIA hardware without technical analysis
MPI and Chroot Nightmare - Setting up proper test environments
The Unified Memory Revelation - Why Docker’s cgroups double-count unified memory
The Data: 60 Runs Don’t Lie - Comprehensive benchmark results across 3 models
What I Learned (And What’s Next) - Conclusions and Phase 2 preview
KV Cache Deep Dive - The 2x reduction mystery explained

Benchmark Code & Data: benchmark-spark
Interactive Results: brandonrc.github.io/benchmark-spark

Impact

This investigation demonstrates the importance of understanding the full stack - from hardware architecture to kernel subsystems to container runtimes. What looked like a hardware problem was actually a software architecture mismatch.

KV Cache Deep Dive: The 2x Reduction Mystery

The Pattern That Demands Explanation In Part 4, I showed you 60 benchmark runs that revealed a consistent pattern. But one table in particular kept me up at night: Model Native KV Container KV Reduction Factor DeepSeek-7B 44.31 GiB 16.57 GiB 2.7x less GPT-OSS-120B 43.19 GiB 23.65 GiB 1.8x less Qwen2.5-72B 44.71 GiB 26.72 GiB 1.7x less Two questions immediately jump out: Why do containers consistently allocate ~2x less KV cache? (The main mystery) Why do all native runs converge to ~44 GiB? (The secondary puzzle) Let’s answer both. ...

What I Learned (And What's Next)

The Journey So Far Let’s recap this wild ride: The Problem: YouTube reviewers blamed NVIDIA hardware for being slow The Investigation: I found 20-30GB memory overhead in Docker containers The Environment Setup: MPI and chroot configuration nightmare The Revelation: Docker’s cgroups double-count unified memory The Data: 60 runs confirmed the pattern consistently Now, what do I actually do with this knowledge? Key Finding: Don’t Blame the Hardware The most important lesson from this entire investigation: ...

The Data: 60 Runs Don't Lie

The Comprehensive Test Plan Anecdotes are interesting. Single data points are suggestive. But 60 benchmark runs across multiple models? That’s science. Here’s what I did: Test Matrix 3 Models: DeepSeek-R1-Distill-Qwen-7B (7 billion parameters) Qwen2.5-72B-Instruct (72 billion parameters) GPT-OSS-120B (120 billion parameters, MXFP4 quantized) 2 Environments: Native (chroot) vs Container (Docker) 10 Iterations per model per environment Total: 60 benchmark runs Methodology Framework: TensorRT-LLM (trtllm-bench CLI) Workload: 50 requests, 128 output tokens per request Cooldown: 5 minutes + GPU temp check (< 45°C) between runs Duration: ~14 hours total Metrics: Peak memory, KV cache, throughput, latency, temperature DeepSeek-7B Results My smallest model, but the most shocking results: ...

The Unified Memory Revelation: Why Docker Double-Counts

The Question That Started It All After getting both Docker and native environments working, I could finally run proper benchmarks. But I kept asking myself: “Where is the 26GB going?” It wasn’t GPU overhead - containers don’t add 26GB of process memory. It wasn’t the Docker daemon - that’s tiny. It wasn’t duplicate libraries - bind mounts prevent that. So… where? Traditional GPU Systems (The Old Way) Let’s start with how most GPU systems work. Take an NVIDIA H100 or A100: ...

Down the Rabbit Hole: The MPI and Chroot Nightmare

The Simple Plan (That Wasn’t Simple) After seeing the mysterious 26GB memory overhead in Docker, my plan was straightforward: Extract the container’s filesystem Run the same Python scripts natively Compare the results Done! Ha. Hahahaha. No. Attempt 1: Just Run It My first thought: “Let’s just run the Docker scripts on my system. How hard can it be?” 1 2 python /home/khan/benchmark-spark/benchmarks/trtllm_benchmark.py \ --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B What I got: Runtime nightmare. ...

The Mystery: Don't Just Blame the Hardware

The YouTube Problem If you search for “DGX Spark performance” on YouTube, you’ll find plenty of videos with clickbait titles like “NVIDIA’s $X Machine is a DISAPPOINTMENT” or “Grace Blackwell: Overhyped and Underdelivering.” And that really bothers me. Not because I’m an NVIDIA fanboy (I’m not), but because none of these reviewers provided a technical explanation of why performance wasn’t meeting expectations. They just pointed at benchmark numbers, said “slow,” and moved on. No investigation into kernel settings, driver versions, container configurations, or software stack optimization. Just… blame the hardware. ...

Overview#

Key Findings#

The Series#

Related Resources#

Impact#

Overview

Key Findings

The Series

Related Resources

Impact