Defensive Publications Series

Power-efficient Speech Detection and Transcription Utilizing Cascaded Visual Triggers and Sensor Fusion

Jon HurwitzFollow
Yaaseen MahomedFollow

Abstract

Voice interaction is often the primary input modality for wearable electronic devices, such as augmented reality (AR) glasses. However, effective interaction via a voice interface can be difficult in certain common scenarios such as noisy environments, quiet environments where the wearer cannot provide audible spoken input, etc. This disclosure describes techniques that enable adaptive, multi-modal speech-to-text transcription on smart glasses using sensor fusion. The smart glasses include microphones (e.g., a microphone array), a vibration sensor (e.g., in the nose bridge), and low-power cameras that have the wearer’s lips in the field of view. Data streams from these sensors are combined to produce text output. Power efficiency is achieved through a staged architecture. In the first stage, low-power components are used to detect when the user is likely speaking. If the user is detected to be speaking, the device application processor is activated and an adaptive multimodal fusion model is used to assign respective weights to microphone, vibration sensor, and camera inputs based on detected noise levels to enable high-quality speech transcription. multiple hardware components.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Hurwitz, Jon and Mahomed, Yaaseen, "Power-efficient Speech Detection and Transcription Utilizing Cascaded Visual Triggers and Sensor Fusion", Technical Disclosure Commons, (December 25, 2025)
https://www.tdcommons.org/dpubs_series/9089

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Power-efficient Speech Detection and Transcription Utilizing Cascaded Visual Triggers and Sensor Fusion

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Power-efficient Speech Detection and Transcription Utilizing Cascaded Visual Triggers and Sensor Fusion

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information