Abstract

End of turn (EoT) detection is performed in a voice artificial intelligence (AI) solution to detect when a user is done talking. The currently available systems that perform EoT detection face issues such as detection of only the long pauses and added overhead cost from conversion of audio to text for analysis. To overcome the aforementioned issues, a system is proposed herein that uses a pre-trained Audio Spectrogram Transformer (AST) as a prosodic feature extractor, combined with a high-level transformer encoder that analyzes global prosodic structure. By leveraging this architecture, the system reliably predicts when a speaker has finished their turn, outperforming traditional silence-based methods.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Menon, Adhwaith, "PROSODIC BASED TURN DETECTION USING THE AUDIO SPECTROGRAM TRANSFORMER", Technical Disclosure Commons, (October 27, 2025)
https://www.tdcommons.org/dpubs_series/8792

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

PROSODIC BASED TURN DETECTION USING THE AUDIO SPECTROGRAM TRANSFORMER

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

PROSODIC BASED TURN DETECTION USING THE AUDIO SPECTROGRAM TRANSFORMER

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information