Abstract

A system and method for software-defined runtime dynamic parallelism in large-scale machine learning systems, such as Large Language Models (LLMs) and Mixture of Experts (MoE) models. The disclosed technique addresses bandwidth bottlenecks in static sharding strategies by dynamically remapping parallelism axes between the attention complex and the Feed Forward Network (FFN) complex. Specifically, axes dedicated to Model Parallelism (MP) in the attention block are switched to Sequence Parallelism (SP) or Expert Parallelism (EP) in the FFN block using existing collective operations, such as reduce-scatter and all-gather. This dynamic reconfiguration optimizes the utilization of Inter-Chip Interconnect (ICI) bandwidth and reduces the number of required collective operations, thereby improving training efficiency for models with large sequence lengths or high sparsity. Keywords: Large Language Model (LLM), Distributed Training, Dynamic Parallelism, Tensor Sharding, Mixture of Experts (MoE), Sequence Parallelism, Model Parallelism, Inter-Chip Interconnect (ICI), Collective Operations.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

N/A, "Software-defined Runtime Dynamic Parallelism for Large Scale ML Systems", Technical Disclosure Commons, (February 19, 2026)
https://www.tdcommons.org/dpubs_series/9358

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Software-defined Runtime Dynamic Parallelism for Large Scale ML Systems

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Software-defined Runtime Dynamic Parallelism for Large Scale ML Systems

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information