High Throughput and Low Latency LLM Serving via Adaptive KV Caching

Wenyan Chen, Chengzhi Lu, Kejiang Ye, Huanle Xu, Chengzhong Xu

January 2026

Abstract

The substantial memory demands of model weights and key-value (KV) caches often lead to severe memory bottlenecks in LLM serving. Existing systems address this by offloading KV caches to host memory and rapidly restoring them on demand before decoding. However, these approaches are too coarse-grained and fail to fully exploit the combined computational and storage capabilities of GPUs. In this paper, we introduce eLLM, a novel LLM serving system designed to achieve high throughput and low latency through fine-grained KV caching. The core innovation lies in adaptively storing partial tokens with KV caches while dynamically recomputing non-cached tokens in parallel with decoding, thereby balancing memory usage and computational efficiency. This new mechanism enables dual-level optimizations: At the request level, eLLM employs token-wise caching to adaptively adjust batch sizes and uncached token ratios in real time. At the layer level, eLLM leverages communication-computation overlap and kernel fusion for resource-complementary operations to further enhance throughput and reduce latency. Experiments demonstrate that eLLM achieves 3.03× higher throughput while satisfying strict per-output-token latency SLOs. It also reduces first-token latency by 2.63× compared to state-of-the-art systems.

Type

Conference paper

Publication

In Proceedings of European Conference on Computer Systems (Eurosys) 2026

High Throughput and Low Latency LLM Serving via Adaptive KV Caching

Abstract

Wenyan Chen

2021 - 2025 PhD student

Chengzhi Lu

2022 - 2025 Postdoc

Huanle Xu

2021 - Current

Chengzhong Xu

2019 - Current