Abstract

Natural Language Understanding systems can encounter challenges when resolving lexical ambiguity in code-switched speech, particularly with inter-language homonyms that may be phonetically similar or identical but have different meanings. These challenges can be compounded by a reliance on audio-only input, which may not capture environmental context that can be beneficial for disambiguation. The disclosed method addresses this by concurrently processing an audio stream containing a user's utterance and a video stream capturing the user's visual context. The system can be configured to identify potential homonyms in the transcribed speech and to extract semantic tags, such as for objects or scenes, from the visual data. A fusion module can then align the visual tags with candidate word senses to select a contextually coherent interpretation. This approach may facilitate a more accurate understanding of user intent, potentially improving the reliability of command execution in certain real-world, multimodal interactions.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Kumar S R, Mithun, "A System for Disambiguating Code-Switched Speech Using Visual Context", Technical Disclosure Commons, (September 04, 2025)
https://www.tdcommons.org/dpubs_series/8549

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

A System for Disambiguating Code-Switched Speech Using Visual Context

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

A System for Disambiguating Code-Switched Speech Using Visual Context

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information