Abstract
In high traffic scenarios, the total query-to-response latency for a conversational agent (chatbot) can be as high as a few seconds. The high total latency includes latencies for sending a query to a server and for the server to generate and provide a response. The high latency can make the conversational experience suboptimal for a user, as the user perceives it as a half-duplex conversation, where the user provides a query and needs to wait for a response before providing a follow up. This disclosure describes techniques that enable low latency conversations with a conversational agent with a large language model (LLM) backend by including a multilayer autoregressive completion network within the LLM. The multiple layers of the completion network are leveraged to split attention layers into sections that can be processed on the device, on low-bandwidth servers, on high-bandwidth servers, etc., to provide rapid intermediate responses while a more complete response is being computed in the background.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Shin, D, "Low Latency Conversational Agent via Multistage Autoregressive Completion Networks", Technical Disclosure Commons, (August 22, 2024)
https://www.tdcommons.org/dpubs_series/7301