Inventor(s)

N/AFollow

Abstract

This publication describes a method for accelerating All-Gather and Reduce-Scatter collective operations in large language models (LLMs) by introducing a new 12-bit floating-point (fp12) datatype, specifically e4m7. This approach aims to reduce network communication overhead by quantizing these operations, thereby decreasing the network payload while maintaining model quality. This document details the problem of communication bandwidth limitations in distributed LLM training and serving, explores related work, and presents experimental results demonstrating the efficacy of the fp12 (e4m7) datatype in various model parallelism configurations.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS