Abstract

A technique is proposed for audio-synchronized personal media playback in a streaming interface. An audio signal associated with a media item is obtained. Voice activity data reflecting one or more voice characteristics associated with at least one speaker of the media item is extracted from the audio signal. A voice embedding representing the one or more voice characteristics associated with the at least one speaker is generated. The voice embedding is associated with at least one embedding cluster for the media item. One or more speaker-based playback operations are performed with respect to the media item based on the at least one voice embedding cluster.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS