Abstract

Conversational interfaces enable users to interact with a virtual assistant, chatbot, or other software via spoken audio. In a cascaded conversational system architecture, an automatic speech recognition (ASR) model transcribes a user’s spoken query to text, a large language model (LLM) generates the text of a response, and a text-to-speech (TTS) model generates response audio from the response text. This configuration is subject to high latency due to the need to wait for the ASR model to generate a transcription before the LLM response can be calculated. For a high-quality, full-context transcription, the ASR model can take time. Per techniques of this disclosure, in addition to generating the response based on an initial transcription obtained at a first pass of the ASR model, the LLM is tasked with dynamically determining whether to use the initial transcription from the ASR model or to wait for a more accurate subsequent pass of the model. If the LLM determines that the initial transcription is irrelevant to the ongoing conversation or contains misrecognitions, no response is generated based on the initial transcription and instead, the LLM waits for the more accurate second pass transcription. Conversely, if the first-pass transcription is accurate and relevant, the second pass is skipped (or stopped, if already initiated).

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS