Abstract
Natural Language Understanding systems can encounter challenges when resolving lexical ambiguity in code-switched speech, particularly with inter-language homonyms that may be phonetically similar or identical but have different meanings. These challenges can be compounded by a reliance on audio-only input, which may not capture environmental context that can be beneficial for disambiguation. The disclosed method addresses this by concurrently processing an audio stream containing a user's utterance and a video stream capturing the user's visual context. The system can be configured to identify potential homonyms in the transcribed speech and to extract semantic tags, such as for objects or scenes, from the visual data. A fusion module can then align the visual tags with candidate word senses to select a contextually coherent interpretation. This approach may facilitate a more accurate understanding of user intent, potentially improving the reliability of command execution in certain real-world, multimodal interactions.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Kumar S R, Mithun, "A System for Disambiguating Code-Switched Speech Using Visual Context", Technical Disclosure Commons, (September 04, 2025)
https://www.tdcommons.org/dpubs_series/8549