Inventor(s)

Adhwaith Menon

Abstract

End of turn (EoT) detection is performed in a voice artificial intelligence (AI) solution to detect when a user is done talking. The currently available systems that perform EoT detection face issues such as detection of only the long pauses and added overhead cost from conversion of audio to text for analysis. To overcome the aforementioned issues, a system is proposed herein that uses a pre-trained Audio Spectrogram Transformer (AST) as a prosodic feature extractor, combined with a high-level transformer encoder that analyzes global prosodic structure. By leveraging this architecture, the system reliably predicts when a speaker has finished their turn, outperforming traditional silence-based methods.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS