D ShinFollow


Scaling language models for inference can be difficult due to the computational resource requirement. LLM deployments typically include computational resources for the LLM itself and additional resources that run separate hand-turned or learned models for resource allocation. The overall complexity of resource configuration for server-based LLM deployments includes that for the LLM itself as well as that for the separate models. This disclosure describes the use of a language model to perform its own computational resource management. Per the techniques, resource management metadata is provided as input to the language model as an additional input along with the incoming query. By combining the user query and resource availability metadata into a prompt, the described techniques leverage the predictive power of the LLM to perform computational resource allocation, instead of doing resource allocation via a separate model. The LLM can effectively manage its own resources and there is no need to train or maintain separate models.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.