Abstract

Traditional subtitles lack the ability to convey the spatial origin of sounds, making it difficult to distinguish between speakers or follow complex audiovisual scenes. This limitation impacts comprehension, immersion, and accessibility, particularly in multi-speaker or action-heavy contexts. This disclosure describes techniques to enhance video content via expressive captions that integrate directional sound localization and speaker differentiation. Using microphone arrays or inferred localization data, the approach identifies the spatial origin of sound sources in real time. Speaker diarization distinguishes speakers based on direction. Subtitles are extended with spatial metadata to dynamically link text to positions on the screen. Expressive captions incorporate visual elements such as directional arrows, color codes, and augmented reality overlays to effectively convey sound direction and speaker identity. This approach improves comprehension, enhances immersion, and increases accessibility, providing a richer and more intuitive viewing experience across video streaming platforms, conferencing tools, and accessibility services.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS