Dongeek ShinFollow


A framework is proposed for reducing latency for real-time speaking applications, such as virtual assistants. The framework utilizes a hybrid scheme between a large language model (LLM) and a small sound model (SSM). The SSM timestamps the end of a user’s verbal query and notifies the transcription engine that the query has ended. The transcription engine will then abort and send the existing set of transcriptions to the LLM for processing without waiting for the buffer to meet the specific value or time. This minimizes the net response time for real time speech-based models due to the reduced latency from the transcription engine.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.