Investigation

KV Cache Deep Dive: The 2x Reduction Mystery

The Pattern That Demands Explanation In Part 4, I showed you 60 benchmark runs that revealed a consistent pattern. But one table in particular kept me up at night: Model Native KV Container KV Reduction Factor DeepSeek-7B 44.31 GiB 16.57 GiB 2.7x less GPT-OSS-120B 43.19 GiB 23.65 GiB 1.8x less Qwen2.5-72B 44.71 GiB 26.72 GiB 1.7x less Two questions immediately jump out: Why do containers consistently allocate ~2x less KV cache? (The main mystery) Why do all native runs converge to ~44 GiB? (The secondary puzzle) Let’s answer both. ...

What I Learned (And What's Next)

The Journey So Far Let’s recap this wild ride: The Problem: YouTube reviewers blamed NVIDIA hardware for being slow The Investigation: I found 20-30GB memory overhead in Docker containers The Environment Setup: MPI and chroot configuration nightmare The Revelation: Docker’s cgroups double-count unified memory The Data: 60 runs confirmed the pattern consistently Now, what do I actually do with this knowledge? Key Finding: Don’t Blame the Hardware The most important lesson from this entire investigation: ...

The Data: 60 Runs Don't Lie

The Comprehensive Test Plan Anecdotes are interesting. Single data points are suggestive. But 60 benchmark runs across multiple models? That’s science. Here’s what I did: Test Matrix 3 Models: DeepSeek-R1-Distill-Qwen-7B (7 billion parameters) Qwen2.5-72B-Instruct (72 billion parameters) GPT-OSS-120B (120 billion parameters, MXFP4 quantized) 2 Environments: Native (chroot) vs Container (Docker) 10 Iterations per model per environment Total: 60 benchmark runs Methodology Framework: TensorRT-LLM (trtllm-bench CLI) Workload: 50 requests, 128 output tokens per request Cooldown: 5 minutes + GPU temp check (< 45°C) between runs Duration: ~14 hours total Metrics: Peak memory, KV cache, throughput, latency, temperature DeepSeek-7B Results My smallest model, but the most shocking results: ...

The Unified Memory Revelation: Why Docker Double-Counts

The Question That Started It All After getting both Docker and native environments working, I could finally run proper benchmarks. But I kept asking myself: “Where is the 26GB going?” It wasn’t GPU overhead - containers don’t add 26GB of process memory. It wasn’t the Docker daemon - that’s tiny. It wasn’t duplicate libraries - bind mounts prevent that. So… where? Traditional GPU Systems (The Old Way) Let’s start with how most GPU systems work. Take an NVIDIA H100 or A100: ...

Down the Rabbit Hole: The MPI and Chroot Nightmare

The Simple Plan (That Wasn’t Simple) After seeing the mysterious 26GB memory overhead in Docker, my plan was straightforward: Extract the container’s filesystem Run the same Python scripts natively Compare the results Done! Ha. Hahahaha. No. Attempt 1: Just Run It My first thought: “Let’s just run the Docker scripts on my system. How hard can it be?” 1 2 python /home/khan/benchmark-spark/benchmarks/trtllm_benchmark.py \ --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B What I got: Runtime nightmare. ...

The Mystery: Don't Just Blame the Hardware

The YouTube Problem If you search for “DGX Spark performance” on YouTube, you’ll find plenty of videos with clickbait titles like “NVIDIA’s $X Machine is a DISAPPOINTMENT” or “Grace Blackwell: Overhyped and Underdelivering.” And that really bothers me. Not because I’m an NVIDIA fanboy (I’m not), but because none of these reviewers provided a technical explanation of why performance wasn’t meeting expectations. They just pointed at benchmark numbers, said “slow,” and moved on. No investigation into kernel settings, driver versions, container configurations, or software stack optimization. Just… blame the hardware. ...