Abstract
Three-dimensional (3D) virtual environments synthesized from two-dimensional images currently lack an audiovisual accompaniment appropriate to the scenes in the images, thereby detracting from the immersive experience. This disclosure describes techniques to generate a spatial audio track appropriate to a 3D virtual scene and to insert the audio track into the 3D visual scene. Two-dimensional images of a given location are analyzed by a visual language model to obtain a text description of the location. The text description is provided as input to a music language model to generate audio that is appropriate to the text description (and hence to the underlying image set). The generated audio is subjected to spatial encoding, source separation, and/or remixing to spatially associate sounds with the locations of corresponding sources in the virtual space. The techniques can provide an immersive audiovisual experience within augmented/extended/virtual reality environments.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Shin, D and Nongpiur, Rajeev, "Artificial Intelligence Generated Spatial Audio for Immersive Virtual Reality Scenes", Technical Disclosure Commons, (November 05, 2024)
https://www.tdcommons.org/dpubs_series/7499