Abstract

While augmented reality (AR) devices enable users to provide queries via different modalities such as touch, gesture, voice, camera, etc., providing multimodal queries currently requires a user to perform multiple manual steps. This disclosure describes techniques, implemented with user permission, for automatic intelligent multimodal input capture by an AR device to understand user queries by automatically capturing input from additional modalities besides the one that the user uses to provide the query. The initial query (e.g., a spoken query) is decoded, e.g., with a transformer decoder, and if the query indicates an additional modality for input, the decoding is paused, and the additional modality is triggered. For example, in response to a spoken query that indicates a visual input, the device camera is automatically activated, and the image(s) captured by the camera are interpreted jointly with the spoken query to generate a response to the user. The triggering event for the additional modality can be a valid token in the transformer decoder.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Shin, D, "Transformer Decoder with Tokenized Image Snapshot for Interactive Augmented Reality", Technical Disclosure Commons, (January 11, 2024)
https://www.tdcommons.org/dpubs_series/6594

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Transformer Decoder with Tokenized Image Snapshot for Interactive Augmented Reality

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Transformer Decoder with Tokenized Image Snapshot for Interactive Augmented Reality

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information