D ShinFollow


While augmented reality (AR) devices enable users to provide queries via different modalities such as touch, gesture, voice, camera, etc., providing multimodal queries currently requires a user to perform multiple manual steps. This disclosure describes techniques, implemented with user permission, for automatic intelligent multimodal input capture by an AR device to understand user queries by automatically capturing input from additional modalities besides the one that the user uses to provide the query. The initial query (e.g., a spoken query) is decoded, e.g., with a transformer decoder, and if the query indicates an additional modality for input, the decoding is paused, and the additional modality is triggered. For example, in response to a spoken query that indicates a visual input, the device camera is automatically activated, and the image(s) captured by the camera are interpreted jointly with the spoken query to generate a response to the user. The triggering event for the additional modality can be a valid token in the transformer decoder.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.