Inventor(s)

N/AFollow

Abstract

A system and method for software-defined runtime dynamic parallelism in large-scale machine learning systems, such as Large Language Models (LLMs) and Mixture of Experts (MoE) models. The disclosed technique addresses bandwidth bottlenecks in static sharding strategies by dynamically remapping parallelism axes between the attention complex and the Feed Forward Network (FFN) complex. Specifically, axes dedicated to Model Parallelism (MP) in the attention block are switched to Sequence Parallelism (SP) or Expert Parallelism (EP) in the FFN block using existing collective operations, such as reduce-scatter and all-gather. This dynamic reconfiguration optimizes the utilization of Inter-Chip Interconnect (ICI) bandwidth and reduces the number of required collective operations, thereby improving training efficiency for models with large sequence lengths or high sparsity.   Keywords: Large Language Model (LLM), Distributed Training, Dynamic Parallelism, Tensor Sharding, Mixture of Experts (MoE), Sequence Parallelism, Model Parallelism, Inter-Chip Interconnect (ICI), Collective Operations.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS