KV Cache Deep Dive: The 2x Reduction Mystery

The Pattern That Demands Explanation

In Part 4, I showed you 60 benchmark runs that revealed a consistent pattern. But one table in particular kept me up at night:

Model	Native KV	Container KV	Reduction Factor
DeepSeek-7B	44.31 GiB	16.57 GiB	2.7x less
GPT-OSS-120B	43.19 GiB	23.65 GiB	1.8x less
Qwen2.5-72B	44.71 GiB	26.72 GiB	1.7x less

Two questions immediately jump out:

Why do containers consistently allocate ~2x less KV cache? (The main mystery)
Why do all native runs converge to ~44 GiB? (The secondary puzzle)

Let’s answer both.

Question 1: Why ~2x Less KV Cache in Containers?

The High-Level Answer

From Part 3, you know Docker’s cgroups double-count unified memory. But how does that translate to exactly 40-60% less KV cache?

The answer lies in how TensorRT-LLM allocates KV cache memory.

TensorRT-LLM’s Memory Budget Calculation

When TensorRT-LLM starts up, it performs a memory budget calculation:

Step 1 is where Docker breaks everything.

Native Execution: Clean Memory View

In native/chroot execution, TensorRT-LLM queries memory directly via CUDA:

Native allocation logic:

Total available: ~105 GB
Model + activations: ~60-65 GB
Remaining: ~40-45 GB
KV cache allocated: ~44 GB ✓

Clean, simple, accurate.

Container Execution: Docker’s Interference

In Docker containers, TensorRT-LLM still uses CUDA APIs, but Docker’s cgroup accounting pollutes the results:

Container allocation logic:

Docker reports: ~60 GB available (wrong!)
Model + activations: ~60-65 GB
Remaining: ~0-5 GB (uh oh…)
KV cache allocated: ~16-26 GB (conservative!)

TensorRT-LLM plays it safe and allocates less KV cache to avoid OOM errors.

The Math: Where Does the Overhead Come From?

Look at this correlation from Part 5:

Model	Container Overhead	KV Cache Reduction
DeepSeek-7B	+30.83 GiB	-27.74 GiB
Qwen-72B	+19.99 GiB	-17.99 GiB
GPT-OSS-120B	+21.71 GiB	-19.54 GiB

Notice: Container overhead ≈ KV cache reduction (within 1-3 GB)

This isn’t a coincidence. Here’s what’s happening:

Native memory allocation:

Container memory allocation:

Docker’s double-counting creates phantom overhead that steals memory from the KV cache budget.

Why Not 50% Exactly?

You might expect exactly 50% reduction if Docker was reporting half the memory. But the reduction varies (1.7x - 2.7x) because:

Base model size differs - Smaller models have proportionally more KV cache
Docker overhead scales - Larger allocations = more double-counting
TensorRT-LLM is conservative - It leaves safety margins

The pattern: Smaller models suffer worse because the fixed Docker overhead consumes a larger percentage of their memory budget.

Question 2: Why Do All Natives Converge to ~44 GiB?

This is the secondary mystery that caught my attention:

Model Size	Parameters	Native KV Cache
DeepSeek	7B	44.31 GiB
Qwen	72B	44.71 GiB
GPT-OSS	120B	43.19 GiB

Wait… the 7B model uses essentially the same KV cache as the 120B model?

Why This Seems Wrong

Intuitively, you’d expect:

Larger models → More parameters → More layers → More KV cache needed

But that’s not what the data shows!

The Explanation: Available Memory Ceiling

TensorRT-LLM doesn’t allocate KV cache based on model size. It allocates based on available memory after loading the model.

Let’s look at the native memory breakdown:

DeepSeek-7B (Small model):

Qwen2.5-72B (Large model):

Wait, that math doesn’t work…

The Real Answer: TensorRT-LLM’s Allocation Strategy

After analyzing the pattern, I believe TensorRT-LLM uses a ceiling-based allocation strategy:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Pseudocode for TensorRT-LLM KV cache allocation
def allocate_kv_cache():
    total_memory = query_total_unified_memory()  # ~128 GB

    # Load model first
    model_memory = load_model_weights()

    # Calculate available
    available = total_memory - model_memory - system_reserve

    # Allocate KV cache with ceiling
    max_kv_cache = 44 GB  # Hardcoded or configured ceiling?
    kv_cache = min(available * 0.6, max_kv_cache)

    return kv_cache

This would explain why:

Small models: Have tons of available memory, but hit the 44 GB ceiling
Large models: Have less available memory, but still allocate close to 44 GB

Why a Ceiling Exists

There are good engineering reasons for a KV cache ceiling:

Prevents memory thrashing - Too large a cache can hurt performance
Reserves memory for batching - Need room for concurrent requests
Conservative defaults - Better to be safe than OOM
Hardware limitations - Memory bandwidth or cache line considerations

The 72B Anomaly

Notice Qwen-72B has the highest native KV cache (44.71 GB). This might be because:

Its model architecture is more memory-efficient
Quantization or sparsity reduces weight memory
TensorRT-LLM optimizations for this model family

But it’s still very close to the ~44 GB ceiling.

What This Means for Production

Understanding these two patterns has huge implications:

1. Container KV Cache Reduction Matters More at Scale

For small workloads (like my benchmark: 50 requests, 128 tokens), even 16 GB of KV cache is enough.

But in production:

8k context window: You’ll hit the limit fast
32k context window: Containers will OOM or refuse requests
128k context window: Forget it - you need native execution

The 2x KV cache reduction becomes a 2x throughput bottleneck at scale.

2. Model Size Doesn’t Predict KV Cache Availability

The ~44 GB convergence means:

Small models (7B) don’t get “extra” KV cache just because they’re small
Large models (72B+) aren’t starved of KV cache just because they’re large

Recommendation: Choose models based on:

Accuracy requirements (obviously)
Weight memory vs KV cache tradeoff
Don’t assume bigger = less memory for serving

3. Docker is OK for Specific Use Cases

If you’re running:

Short contexts (≤2k tokens)
Low concurrency (1-4 simultaneous requests)
Small batches

Then Docker’s reduced KV cache might be acceptable. But know you’re leaving 40-60% of serving capacity on the table.

4. Native Execution for Maximum Throughput

For production serving with:

Long contexts (8k+)
High concurrency (10+ simultaneous users)
Large batch sizes

Use native/chroot execution. The 2x KV cache advantage translates directly to 2x serving capacity.

Visual Summary

Let me show you the full picture:

Key Takeaways

On the ~2x container reduction:

Docker’s cgroup double-counts unified memory
TensorRT-LLM sees “less available” memory
Conservatively allocates smaller KV cache
Container overhead ≈ KV cache reduction (almost 1:1)
Smaller models suffer worse reduction ratios

On the ~44 GiB native convergence:

TensorRT-LLM likely has a KV cache ceiling (~44 GB)
Small models hit the ceiling despite having more free memory
Large models allocate near the ceiling despite tight budgets
This is good engineering: prevents pathological cases
Model size doesn’t predict KV cache availability

What’s Next?

This deep dive answered the “what” and “why” of the KV cache patterns. But there are still open questions:

Can we configure TensorRT-LLM’s KV cache ceiling?
Can Docker be patched to handle unified memory correctly?
Do other unified memory systems (AMD MI300X) show the same pattern?
What’s the optimal KV cache size for different workloads?

These are questions for Phase 2 of the investigation. For now, the message is clear:

If you’re running large language models on Grace Blackwell in production, understand your KV cache allocation. It’s the difference between maximal throughput and leaving half your serving capacity on the table.

Previous: ← Part 5: What I Learned (And What’s Next)

GitHub Repo: benchmark-spark Interactive Charts: Results Dashboard

Got questions or observations? Open an issue or discussion on the GitHub repo!

The Pattern That Demands Explanation#

Question 1: Why ~2x Less KV Cache in Containers?#

The High-Level Answer#

TensorRT-LLM’s Memory Budget Calculation#

Native Execution: Clean Memory View#

Container Execution: Docker’s Interference#

The Math: Where Does the Overhead Come From?#

Why Not 50% Exactly?#

Question 2: Why Do All Natives Converge to ~44 GiB?#

Why This Seems Wrong#

The Explanation: Available Memory Ceiling#

The Real Answer: TensorRT-LLM’s Allocation Strategy#

Why a Ceiling Exists#

The 72B Anomaly#

What This Means for Production#

1. Container KV Cache Reduction Matters More at Scale#

2. Model Size Doesn’t Predict KV Cache Availability#

3. Docker is OK for Specific Use Cases#

4. Native Execution for Maximum Throughput#

Visual Summary#

Key Takeaways#

What’s Next?#