Abstract
The technology described in this paper relates to parallel processing in machine learning models, particularly speeding up expert parallelism in mixture-of-expert (MoE) architectures. A technique for replacing traditional all-to-all network collective communications with all-gather and reduce-scatter operations is proposed. For expert dispatch, accelerators can locally perform routing and top-k calculations by utilizing an all-gather operation and discarding unassigned tokens. For expert collect, a reduce-scatter operation can combine expert output activations over the network. These collective communication strategies can decrease network payload sizes and improve bandwidth utilization, particularly as the number of experts activated per token increases. Thus, training and inference for large langue models (LLMs) can be optimized by improving the speed of the expert parallelism (EP). By streamlining how data is exchanged across multiple chips, this technique significantly increase processing speeds of artificial intelligence (AI) products. In addition to the improvement in speed, the change to all-gather and reduce-scatter operations addresses the high computational costs that occur when MoE models activate multiple experts per token.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
N/A, "Speeding Up Expert Parallelism By Replacing All-To-All By All-Gather And Reduce-Scatter Collectives", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/9706