Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention Piggybacking

Zizhao Mo, Junlin Chen, Huanle Xu, Chengzhong Xu

February 2026

Abstract

In this paper, we propose OmniServe, a novel LLM serving system that efficiently harnesses both CPU and GPU resources to mitigate interference and improve throughput. Central to OmniServe is the Attention Piggybacking mechanism, which effectively offloads the Attention computation of BE services to CPUs on the fly. This mechanism also facilitates asynchronous communication between CPU and GPU streams, preventing GPUs from being blocked while aggregating Attention results. Additionally, OmniServe incorporates a dynamic batching control policy to adapt to fluctuating request arrivals, facilitating Dense module computation using layer-wise batching. Experimental results show that OmniServe improves the SLO attainment rate for LS services by up to 1.48x while enhancing BE serving throughput by up to 9.85x compared to existing state-of-the-art systems.

Type

Conference paper

Publication

In ACM SIGMOD International Conference on Management of Data 2026

Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention Piggybacking

Abstract

Zizhao Mo

2025 - Current

Junlin Chen

2025 - Current

Huanle Xu

2021 - Current

Chengzhong Xu

2019 - Current