Abstract
Visually impaired individuals face challenges in understanding the spatial location of objects around them, particularly for objects that are at a distance. The lack of spatial awareness can lead to difficulties in navigation and interaction with the environment. This disclosure describes the use of an augmented reality (AR) headset and a multimodal large language model (LLM) to automatically provide useful scene descriptions to users. The headset includes world-side cameras that capture visual information in a user’s surroundings. The headset also includes eye-side cameras that, with user permission, detect the direction and focus of the user’s gaze. The detected eye movements are used to select points of interest (regions of the image captured by the world-side cameras) which are provided as input to the LLM (or other suitable scene summarization technique). The LLM performs recognition and generates a description of the surrounding which is provided to the user as spoken output. The speed of delivery and amount of content in the spoken output can be tailored according to the context and the user’s preferences. Further, longer the duration of gaze, a more detailed description of the point of interest is provided.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Boiarshinov, Dmitrii; Guajardo, Jaime; and Lanning, Gabi, "Providing LLM-generated Point of Interest Description Based on Gaze Tracking", Technical Disclosure Commons, (June 25, 2024)
https://www.tdcommons.org/dpubs_series/7132