Inventor(s)

Abstract

Techniques are described for SSD-backed embedding tables using a compression-aware, asymmetric-precision pipeline. Embedding rows are stored on SSD in reduced-precision integer form (e.g., INT8 or INT4) together with per-row affine quantization parameters including a scale and a zero-point. Requested rows are transferred in compressed form into a GPU-accessible staging buffer, and a GPU kernel dequantizes the rows into a full-precision representation (e.g., FP16/FP32) within a high-bandwidth-memory cache. A double-buffer schedule overlaps SSD reads with GPU dequantization across iterations. When cached embeddings are modified and evicted, a GPU kernel re-quantizes the updated full-precision values, updates per-row quantization parameters, and writes the reduced-precision rows back to SSD. A codebook manager may track distribution drift and adapt quantization policies over time.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS