Abstract
Large language models (LLMs) are capable of answering questions with reference to images or videos. However, LLMs sometimes get spatial relationships wrong. A challenge for training LLMs for 3D scene understanding is the lack of suitably large image and video datasets. This disclosure describes the use of a simulated three-dimensional (3D) environment to obtain two-dimensional (2D) images or videos of the environment from different perspectives. The point is moved in the simulated 3D space and a stream of 2D images is obtained from the different positions. The simulated 3D environment can be used to model various scenarios where the LLM has relatively lower accuracy. Using the gathered images and ground truth from the simulation, supervised fine-tuning (SFT) of the LLM can be performed. Since the data gathering is automated, large amounts of data can be collected at low cost. Further, the simulated 3D environment can be structured to mimic real-world use cases where an LLM is used to determine answers to spatial questions, e.g., in image/video understanding tasks.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Weisz, Ágoston, "Using Images from a Simulated 3D Environment for LLM Fine-tuning for Image and Video Understanding Tasks", Technical Disclosure Commons, (July 17, 2025)
https://www.tdcommons.org/dpubs_series/8374