Abstract

This disclosure describes lightweight techniques to determine segments of an audio or audiovisual stream that include conversations. Per the techniques, a voice activity detector (VAD) isolates super-segments, e.g., segments of relatively long duration. Super-segments are split into smaller segments and mapped into an embeddings space. The embeddings obtained from the smaller segments are clustered. A chunk of video is determined to include a conversation if its length exceeds a certain threshold T and the number of segments it includes exceeds a certain threshold S; it has at least a certain number N of major clusters; and at least a certain number M of major clusters re-occur at least a certain K number of times.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS