Abstract
Techniques manage per-user attention key-value (KV) caches for large language model (LLM) recommendation serving at very large user scale. A cache manager maintains activity windows to bound the number of users whose KV cache tensors are retained. KV caches for users active within a hot window are stored in GPU high-bandwidth memory at higher precision, while KV caches for users active within a warm window are stored in host RAM in a more compact quantized format. Entries are demoted from the hot tier to the warm tier after a hot inactivity threshold using quantization, promoted back to the hot tier on reuse using dequantization, and evicted to a cold state after a warm inactivity threshold such that later requests recompute the KV cache. An activity-aware eviction policy may use a Poisson return model to prioritize retention. Sticky routing via consistent hashing maps repeat user requests to the same serving instance to improve cache hit rate.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Anonymous, "Per-User KV Cache with Active User Windowing and Tiered Memory Placement for Billion-Scale LLM Recommendation", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10713