Inventor(s)

Abstract

Techniques are disclosed for tiled attention computation using reduced-precision quantization with fixed scaling and sigma-delta error feedback. A fixed quantization scale is determined from model configuration parameters and a bound on activation magnitude, enabling quantization of attention-score tiles without runtime per-tile maximum-magnitude reductions. During a tile loop, an error accumulator is maintained for an attention-score tile shape. For each tile, a corrected tile is formed by adding the accumulator to the tile, the corrected tile is quantized to a reduced-precision floating-point format such as FP8 using the fixed scale, the quantized tile is dequantized, and the accumulator is updated based on the difference between the corrected and dequantized tiles. An output is accumulated using a low-precision matrix multiply between the quantized tile and a corresponding value tile with scale application. The approach bounds error growth across tiles while reducing quantization overhead in GPU kernels.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS