Abstract
This publication describes a method for accelerating All-Gather and Reduce-Scatter collective operations in large language models (LLMs) by introducing a new 12-bit floating-point (fp12) datatype, specifically e4m7. This approach aims to reduce network communication overhead by quantizing these operations, thereby decreasing the network payload while maintaining model quality. This document details the problem of communication bandwidth limitations in distributed LLM training and serving, explores related work, and presents experimental results demonstrating the efficacy of the fp12 (e4m7) datatype in various model parallelism configurations.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
N/A, "Optimized Collective Operations for Distributed Machine Learning", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/9557