KV Cache Deep Dive: The 2x Reduction Mystery
The Pattern That Demands Explanation In Part 4, I showed you 60 benchmark runs that revealed a consistent pattern. But one table in particular kept me up at night: Model Native KV Container KV Reduction Factor DeepSeek-7B 44.31 GiB 16.57 GiB 2.7x less GPT-OSS-120B 43.19 GiB 23.65 GiB 1.8x less Qwen2.5-72B 44.71 GiB 26.72 GiB 1.7x less Two questions immediately jump out: Why do containers consistently allocate ~2x less KV cache? (The main mystery) Why do all native runs converge to ~44 GiB? (The secondary puzzle) Let’s answer both. ...