The Data: 60 Runs Don't Lie

The Comprehensive Test Plan

Anecdotes are interesting. Single data points are suggestive. But 60 benchmark runs across multiple models? That’s science.

Here’s what I did:

Test Matrix

3 Models:
- DeepSeek-R1-Distill-Qwen-7B (7 billion parameters)
- Qwen2.5-72B-Instruct (72 billion parameters)
- GPT-OSS-120B (120 billion parameters, MXFP4 quantized)
2 Environments: Native (chroot) vs Container (Docker)
10 Iterations per model per environment
Total: 60 benchmark runs

Methodology

Framework: TensorRT-LLM (trtllm-bench CLI)
Workload: 50 requests, 128 output tokens per request
Cooldown: 5 minutes + GPU temp check (< 45°C) between runs
Duration: ~14 hours total
Metrics: Peak memory, KV cache, throughput, latency, temperature

DeepSeek-7B Results

My smallest model, but the most shocking results:

Metric	Native	Container	Difference
Peak Memory	70.47 GiB	101.30 GiB	+30.83 GiB (44% overhead!)
KV Cache	44.31 GiB	16.57 GiB	-27.74 GiB (63% less!)
Throughput	119.79 tok/s	119.40 tok/s	0.3% difference
Std Dev (σ)	0.55	0.16	Very stable

Analysis: The 7B model shows the largest overhead at 30.8GB. Container KV cache is reduced to only 37% of native. That’s massive.

GPT-OSS-120B Results

The largest model (120B params, though MoE means 5.1B active):

Metric	Native	Container	Difference
Peak Memory	71.72 GiB	93.43 GiB	+21.71 GiB (30% overhead)
KV Cache	43.19 GiB	23.65 GiB	-19.54 GiB (45% less)
Throughput	120.26 tok/s	120.41 tok/s	-0.1% difference
Std Dev (σ)	0.32	0.54	Very stable

Analysis: Despite being the largest model, GPT-OSS shows moderate overhead due to MXFP4 quantization. Native still has 1.8x more KV cache.

Qwen2.5-72B Results

The 72B parameter model - the sweet spot:

Metric	Native	Container	Difference
Peak Memory	70.03 GiB	90.02 GiB	+19.99 GiB (29% overhead)
KV Cache	44.71 GiB	26.72 GiB	-17.99 GiB (40% less)
Throughput	119.33 tok/s	119.51 tok/s	-0.2% difference
Std Dev (σ)	0.28	0.20	Very stable

Analysis: Qwen shows the most efficient memory usage of all models. Native mode has the highest KV cache at 44.71 GiB - even more than the 7B model!

The Pattern: Container Overhead Scales

Look at the overhead across models:

Model	Size	Overhead	Pattern
DeepSeek	7B	+30.83 GiB (44%)	Highest %
Qwen	72B	+19.99 GiB (29%)	Middle
GPT-OSS	120B	+21.71 GiB (30%)	Middle

Insight: Overhead appears proportional to base memory usage, not model size. This suggests Docker’s cgroup accounting scales with allocation size, confirming our unified memory double-counting theory.

Performance Parity: The Good News

Across all 60 runs, performance was virtually identical:

Standard deviations were incredibly low (σ < 0.6 tokens/sec), showing:

Excellent thermal management
Consistent GPU performance
No thermal throttling

The KV Cache Revelation

This is the real story. Across all models:

Model	Native KV	Container KV	Ratio
DeepSeek	44.31 GiB	16.57 GiB	2.7x more
GPT-OSS	43.19 GiB	23.65 GiB	1.8x more
Qwen	44.71 GiB	26.72 GiB	1.7x more

Native mode provides 1.7-2.7x more KV cache across the board. This is huge for:

Longer context windows
Higher batch sizes
Better concurrent request handling
Overall throughput scaling

Temperature Analysis

Interestingly, containers ran slightly cooler:

Environment	Avg End Temp	Range
Native	60.6°C	59-61°C
Container	58.6°C	57-59°C

Why? Container overhead means more time in cooldown between runs. Actual compute time is the same, but total wall time is longer due to additional memory management.

Data Quality and Reproducibility

All 60 runs completed successfully with:

✅ 0 failures - Every benchmark completed
✅ Consistent results - Very low standard deviations
✅ Proper thermal management - No throttling
✅ Comprehensive logging - Full metadata saved

Raw data available: GitHub - benchmark-spark/results

Interactive Charts

Want to explore the data visually? Check out our interactive results page with:

Peak memory comparisons
KV cache allocation charts
Throughput performance graphs

What This Proves

After 60 runs across 3 models, the findings are clear:

Container overhead is real: 20-30 GB consistently
KV cache reduction is significant: 1.7-2.7x less in containers
Performance is identical: No speed penalty for native execution
Pattern is consistent: Happens across all model sizes
Root cause confirmed: Unified memory + cgroup double-counting

This isn’t a fluke. This isn’t a configuration error. This is a fundamental architectural mismatch.

Next up: What we learned, practical recommendations, and what we’re investigating next (spoiler: that KV cache scaling is fascinating).

Previous: ← Part 3: The Unified Memory Revelation
Next: Part 5: What We Learned (And What’s Next) →

GitHub Repo: benchmark-spark
Interactive Charts: Results Dashboard

The Comprehensive Test Plan#

Test Matrix#

Methodology#

DeepSeek-7B Results#

GPT-OSS-120B Results#

Qwen2.5-72B Results#

The Pattern: Container Overhead Scales#

Performance Parity: The Good News#

The KV Cache Revelation#

Temperature Analysis#

Data Quality and Reproducibility#

Interactive Charts#

What This Proves#