Abstract
This disclosure describes lightweight techniques to determine segments of an audio or audiovisual stream that include conversations. Per the techniques, a voice activity detector (VAD) isolates super-segments, e.g., segments of relatively long duration. Super-segments are split into smaller segments and mapped into an embeddings space. The embeddings obtained from the smaller segments are clustered. A chunk of video is determined to include a conversation if its length exceeds a certain threshold T and the number of segments it includes exceeds a certain threshold S; it has at least a certain number N of major clusters; and at least a certain number M of major clusters re-occur at least a certain K number of times.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Chen, Tongzhou and Audhkhasi, Kartik, "Identifying Conversation Segments of an Audiovisual Stream", Technical Disclosure Commons, (August 25, 2025)
https://www.tdcommons.org/dpubs_series/8502