The Comprehensive Test Plan

Anecdotes are interesting. Single data points are suggestive. But 60 benchmark runs across multiple models? That’s science.

Here’s what I did:

Test Matrix

  • 3 Models:
    • DeepSeek-R1-Distill-Qwen-7B (7 billion parameters)
    • Qwen2.5-72B-Instruct (72 billion parameters)
    • GPT-OSS-120B (120 billion parameters, MXFP4 quantized)
  • 2 Environments: Native (chroot) vs Container (Docker)
  • 10 Iterations per model per environment
  • Total: 60 benchmark runs

Methodology

  • Framework: TensorRT-LLM (trtllm-bench CLI)
  • Workload: 50 requests, 128 output tokens per request
  • Cooldown: 5 minutes + GPU temp check (< 45°C) between runs
  • Duration: ~14 hours total
  • Metrics: Peak memory, KV cache, throughput, latency, temperature

DeepSeek-7B Results

My smallest model, but the most shocking results:

MetricNativeContainerDifference
Peak Memory70.47 GiB101.30 GiB+30.83 GiB (44% overhead!)
KV Cache44.31 GiB16.57 GiB-27.74 GiB (63% less!)
Throughput119.79 tok/s119.40 tok/s0.3% difference
Std Dev (σ)0.550.16Very stable

Analysis: The 7B model shows the largest overhead at 30.8GB. Container KV cache is reduced to only 37% of native. That’s massive.

GPT-OSS-120B Results

The largest model (120B params, though MoE means 5.1B active):

MetricNativeContainerDifference
Peak Memory71.72 GiB93.43 GiB+21.71 GiB (30% overhead)
KV Cache43.19 GiB23.65 GiB-19.54 GiB (45% less)
Throughput120.26 tok/s120.41 tok/s-0.1% difference
Std Dev (σ)0.320.54Very stable

Analysis: Despite being the largest model, GPT-OSS shows moderate overhead due to MXFP4 quantization. Native still has 1.8x more KV cache.

Qwen2.5-72B Results

The 72B parameter model - the sweet spot:

MetricNativeContainerDifference
Peak Memory70.03 GiB90.02 GiB+19.99 GiB (29% overhead)
KV Cache44.71 GiB26.72 GiB-17.99 GiB (40% less)
Throughput119.33 tok/s119.51 tok/s-0.2% difference
Std Dev (σ)0.280.20Very stable

Analysis: Qwen shows the most efficient memory usage of all models. Native mode has the highest KV cache at 44.71 GiB - even more than the 7B model!

The Pattern: Container Overhead Scales

Look at the overhead across models:

ModelSizeOverheadPattern
DeepSeek7B+30.83 GiB (44%)Highest %
Qwen72B+19.99 GiB (29%)Middle
GPT-OSS120B+21.71 GiB (30%)Middle

Insight: Overhead appears proportional to base memory usage, not model size. This suggests Docker’s cgroup accounting scales with allocation size, confirming our unified memory double-counting theory.

Performance Parity: The Good News

Across all 60 runs, performance was virtually identical:

Throughput Range:
- Native:    119.33 - 120.26 tokens/sec
- Container: 119.40 - 120.51 tokens/sec
- Difference: < 0.3% (within margin of error)

Standard deviations were incredibly low (σ < 0.6 tokens/sec), showing:

  • Excellent thermal management
  • Consistent GPU performance
  • No thermal throttling

The KV Cache Revelation

This is the real story. Across all models:

ModelNative KVContainer KVRatio
DeepSeek44.31 GiB16.57 GiB2.7x more
GPT-OSS43.19 GiB23.65 GiB1.8x more
Qwen44.71 GiB26.72 GiB1.7x more

Native mode provides 1.7-2.7x more KV cache across the board. This is huge for:

  • Longer context windows
  • Higher batch sizes
  • Better concurrent request handling
  • Overall throughput scaling

Temperature Analysis

Interestingly, containers ran slightly cooler:

EnvironmentAvg End TempRange
Native60.6°C59-61°C
Container58.6°C57-59°C

Why? Container overhead means more time in cooldown between runs. Actual compute time is the same, but total wall time is longer due to additional memory management.

Data Quality and Reproducibility

All 60 runs completed successfully with:

  • 0 failures - Every benchmark completed
  • Consistent results - Very low standard deviations
  • Proper thermal management - No throttling
  • Comprehensive logging - Full metadata saved

Raw data available: GitHub - benchmark-spark/results

Interactive Charts

Want to explore the data visually? Check out our interactive results page with:

  • Peak memory comparisons
  • KV cache allocation charts
  • Throughput performance graphs

What This Proves

After 60 runs across 3 models, the findings are clear:

  1. Container overhead is real: 20-30 GB consistently
  2. KV cache reduction is significant: 1.7-2.7x less in containers
  3. Performance is identical: No speed penalty for native execution
  4. Pattern is consistent: Happens across all model sizes
  5. Root cause confirmed: Unified memory + cgroup double-counting

This isn’t a fluke. This isn’t a configuration error. This is a fundamental architectural mismatch.

Next up: What we learned, practical recommendations, and what we’re investigating next (spoiler: that KV cache scaling is fascinating).


Previous: ← Part 3: The Unified Memory Revelation
Next: Part 5: What We Learned (And What’s Next) →

GitHub Repo: benchmark-spark
Interactive Charts: Results Dashboard