Abstract
Techniques are described for SSD-backed embedding tables using a compression-aware, asymmetric-precision pipeline. Embedding rows are stored on SSD in reduced-precision integer form (e.g., INT8 or INT4) together with per-row affine quantization parameters including a scale and a zero-point. Requested rows are transferred in compressed form into a GPU-accessible staging buffer, and a GPU kernel dequantizes the rows into a full-precision representation (e.g., FP16/FP32) within a high-bandwidth-memory cache. A double-buffer schedule overlaps SSD reads with GPU dequantization across iterations. When cached embeddings are modified and evicted, a GPU kernel re-quantizes the updated full-precision values, updates per-row quantization parameters, and writes the reduced-precision rows back to SSD. A codebook manager may track distribution drift and adapt quantization policies over time.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Anonymous, "Compression-Aware Asymmetric-Precision Pipeline for SSD-Backed Embedding Tables", Technical Disclosure Commons, (June 30, 2026)
https://www.tdcommons.org/dpubs_series/10660