Abstract

This publication describes a method for accelerating All-Gather and Reduce-Scatter collective operations in large language models (LLMs) by introducing a new 12-bit floating-point (fp12) datatype, specifically e4m7. This approach aims to reduce network communication overhead by quantizing these operations, thereby decreasing the network payload while maintaining model quality. This document details the problem of communication bandwidth limitations in distributed LLM training and serving, explores related work, and presents experimental results demonstrating the efficacy of the fp12 (e4m7) datatype in various model parallelism configurations.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

N/A, "Optimized Collective Operations for Distributed Machine Learning", Technical Disclosure Commons, (March 18, 2026)
https://www.tdcommons.org/dpubs_series/9557

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Optimized Collective Operations for Distributed Machine Learning

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Optimized Collective Operations for Distributed Machine Learning

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information