Abstract
A mixture of experts (MoE) is a way of building machine learning models (including large language models) that improves efficiency and scalability by activating only a subset of the model parameters during each inference step. The activated model-parameter subset is an expert in the domain of the input query. This disclosure describes techniques for efficient, dynamic specialization of LLMs that are based on a macro-MoE architecture. A plurality of foundational, pre-trained LLMs (‘macro-experts’), a library of distinct low-rank adaptation (LoRA) modules, and an adaptive gating network is provided. Upon receiving an input prompt, the gating network analyzes the prompt and selects an optimal pair that comprises a foundational macro-expert and a task-specific LoRA module from the library. During inference, a dynamic application engine applies the selected LoRA module to the chosen expert, creating a temporary, highly specialized model to handle the request. By thus combining generalist models with specialist adapters, fine-grained customization becomes possible, such that a single system can exhibit expert-level performance across multiple domains without the computational and storage costs of maintaining numerous fine-tuned models.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
NA, "On-the-Fly Adaptation of MacroMoE LLMs Using Task-Specific Low-Rank Adapters", Technical Disclosure Commons, (August 18, 2025)
https://www.tdcommons.org/dpubs_series/8469