Abstract

Voice interaction is often the primary input modality for wearable electronic devices, such as augmented reality (AR) glasses. However, effective interaction via a voice interface can be difficult in certain common scenarios such as noisy environments, quiet environments where the wearer cannot provide audible spoken input, etc. This disclosure describes techniques that enable adaptive, multi-modal speech-to-text transcription on smart glasses using sensor fusion. The smart glasses include microphones (e.g., a microphone array), a vibration sensor (e.g., in the nose bridge), and low-power cameras that have the wearer’s lips in the field of view. Data streams from these sensors are combined to produce text output. Power efficiency is achieved through a staged architecture. In the first stage, low-power components are used to detect when the user is likely speaking. If the user is detected to be speaking, the device application processor is activated and an adaptive multimodal fusion model is used to assign respective weights to microphone, vibration sensor, and camera inputs based on detected noise levels to enable high-quality speech transcription. multiple hardware components.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS