Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention Piggybacking

Abstract

In this paper, we propose OmniServe, a novel LLM serving system that efficiently harnesses both CPU and GPU resources to mitigate interference and improve throughput. Central to OmniServe is the Attention Piggybacking mechanism, which effectively offloads the Attention computation of BE services to CPUs on the fly. This mechanism also facilitates asynchronous communication between CPU and GPU streams, preventing GPUs from being blocked while aggregating Attention results. Additionally, OmniServe incorporates a dynamic batching control policy to adapt to fluctuating request arrivals, facilitating Dense module computation using layer-wise batching. Experimental results show that OmniServe improves the SLO attainment rate for LS services by up to 1.48x while enhancing BE serving throughput by up to 9.85x compared to existing state-of-the-art systems.

Publication
In ACM SIGMOD International Conference on Management of Data 2026
Zizhao Mo
Zizhao Mo
2025 - Current
Junlin Chen
Junlin Chen
2025 - Current

.

Huanle Xu
Huanle Xu
2021 - Current

I am currently an assistant professor from the Department of Computer and Information Scicence, Univeristy of Macau.

Chengzhong Xu
Chengzhong Xu
2019 - Current

I am currently a Chair Professor in the Department of Computer and Information Science and serve as the Dean of the Faculty of Science and Technology at the University of Macau.