The Unified Memory Revelation: Why Docker Double-Counts

The Question That Started It All

After getting both Docker and native environments working, I could finally run proper benchmarks. But I kept asking myself:

“Where is the 26GB going?”

It wasn’t GPU overhead - containers don’t add 26GB of process memory.
It wasn’t the Docker daemon - that’s tiny.
It wasn’t duplicate libraries - bind mounts prevent that.

So… where?

Traditional GPU Systems (The Old Way)

Let’s start with how most GPU systems work. Take an NVIDIA H100 or A100:

Key points:

CPU has its own RAM (DDR)
GPU has its own VRAM (HBM)
Data moves between them over PCIe bus
They’re separate memory spaces

When Docker runs on these systems:

Docker’s cgroups manage CPU RAM only
GPU VRAM is outside Docker’s control
nvidia-docker just passes through GPU access
No double-counting because they’re separate

Grace Blackwell: The Game Changer

Now look at Grace Blackwell (our DGX Spark):

Revolutionary differences:

One memory pool for both CPU and GPU
No PCIe transfers - both access the same RAM
Coherent at the hardware level
The CPU “sees” GPU allocations and vice versa

This is elegant! No more copying between CPU and GPU. It’s all one big shared memory space.

The Docker cgroup Problem

Here’s where things go sideways.

Docker uses Linux cgroups (control groups) to isolate and track container resources:

1
2
3
# Docker creates a memory cgroup for each container
/sys/fs/cgroup/docker/<container_id>/memory.max
/sys/fs/cgroup/docker/<container_id>/memory.current

On a traditional GPU system, cgroups see:

CPU RAM: Managed by cgroup ✓
GPU VRAM: Outside cgroup (invisible) ✓

On Grace Blackwell unified memory, cgroups see:

The entire 128GB pool as “system RAM” ✗

The Double-Counting

Here’s what happens when you run a model in a Docker container on Grace Blackwell:

Model loads into unified memory (let’s say 70GB)
CUDA driver records this as GPU allocation
Docker’s cgroup sees 70GB of “container RAM” used
TensorRT-LLM tries to allocate KV cache
Docker thinks: “Container is using 70GB already, only X GB left”
Reality: That 70GB is THE SAME MEMORY, just counted twice!

Result: Docker reserves extra headroom because it thinks GPU memory is separate “container RAM”, even though it’s not.

The Evidence

Let’s look at what we saw:

Native (Chroot)

Container (Docker)

Docker’s cgroup sees 104.92GB used, but the actual model only needs 78GB. The difference (26.6GB) is phantom overhead from Docker’s memory accounting trying to “reserve” space that’s already in use.

Why Isn’t This a Problem on Discrete GPUs?

On H100/A100 systems, cgroups can’t even see GPU VRAM. It’s on a separate PCIe device. So there’s no double-counting:

The KV Cache Impact

This double-counting has a massive effect on KV cache:

More KV cache means:

Longer context windows
Higher batch sizes
More concurrent requests
Better overall throughput scaling

That 26GB overhead isn’t just wasted RAM - it’s stolen capacity for serving workloads.

Why Performance Is Still The Same

You might wonder: “If Docker has less KV cache, why is throughput identical?”

Good question! For our specific benchmark (50 requests, 128 output tokens):

Even 13GB of KV cache was enough
We weren’t hitting the cache limit
Throughput was compute-bound, not memory-bound

But in production with:

Longer contexts (8k, 32k, 128k tokens)
Higher batch sizes
Many concurrent users

That reduced KV cache would absolutely become a bottleneck.

Can Docker Be Fixed?

Maybe? Potential solutions:

Update nvidia-container-toolkit for unified memory awareness
Use --memory=unlimited to disable cgroup memory limits
Special cgroup configuration for Grace Blackwell
Wait for Docker/kernel patches that understand unified memory

But for now, the simplest solution: Use native execution for large models on Grace Blackwell.

The Takeaway

This isn’t Docker being “bad” or Grace Blackwell being “broken.” It’s a mismatch between technology generations:

Docker’s cgroups: Designed for discrete GPU era
Grace Blackwell: Next-gen unified memory architecture
Result: Software assumptions don’t match hardware reality

And that’s why you can’t just blame the hardware. The entire stack matters.

In the next post, I’ll show you the data: 60 comprehensive benchmark runs across 3 different models, proving this pattern holds consistently.

Previous: ← Part 2: MPI and Chroot Nightmare
Next: Part 4: The Data - 60 Runs Don’t Lie →

GitHub Repo: benchmark-spark

The Question That Started It All#

Traditional GPU Systems (The Old Way)#

Grace Blackwell: The Game Changer#

The Docker cgroup Problem#

The Double-Counting#

The Evidence#

Native (Chroot)#

Container (Docker)#

Why Isn’t This a Problem on Discrete GPUs?#

The KV Cache Impact#

Why Performance Is Still The Same#

Can Docker Be Fixed?#

The Takeaway#