Abstract

In high traffic scenarios, the total query-to-response latency for a conversational agent (chatbot) can be as high as a few seconds. The high total latency includes latencies for sending a query to a server and for the server to generate and provide a response. The high latency can make the conversational experience suboptimal for a user, as the user perceives it as a half-duplex conversation, where the user provides a query and needs to wait for a response before providing a follow up. This disclosure describes techniques that enable low latency conversations with a conversational agent with a large language model (LLM) backend by including a multilayer autoregressive completion network within the LLM. The multiple layers of the completion network are leveraged to split attention layers into sections that can be processed on the device, on low-bandwidth servers, on high-bandwidth servers, etc., to provide rapid intermediate responses while a more complete response is being computed in the background.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Shin, D, "Low Latency Conversational Agent via Multistage Autoregressive Completion Networks", Technical Disclosure Commons, (August 22, 2024)
https://www.tdcommons.org/dpubs_series/7301

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Low Latency Conversational Agent via Multistage Autoregressive Completion Networks

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Low Latency Conversational Agent via Multistage Autoregressive Completion Networks

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information