D ShinFollow


Server-based large language model (LLM) interfaces that support a large user base need efficient allocation of computational resources is important to deliver responses in a timely manner. FIFO allocation of queries to the LLM can result in unpredictable and/or long median wait times, depending on the arrival rate of queries and query processing time. This disclosure describes techniques that improve resource allocation for a large language model (LLM) by detecting a contextual pause and enabling next query processing during the pause. A transformer design and allocation scheme is presented that retrains the transformer decoder with a contextual pause token that can be fetched at the output layer autoregressively. The contextual pause token can mark and split parts of a large paragraph into chunks that have contextual consistency. The net usage of the token is to dynamically adjust inference prioritization for users that have not received responses at all over the ones who reached an early contextual pause token and can take time to digest the response information. The described techniques can enable lower shorter wait times on average, without degrading the user experience.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.