Abstract
Obtaining training data to train video understanding models traditionally involves human experts that generate text based on a video. This is expensive, prone to human errors, and not easily scalable. This disclosure describes techniques that leverage machine learning models to automatically generate training data for training models for video understanding. Video and image captioning models and speech-to-text techniques are used to generate text captions for training videos. A large language model LLM) is prompted to generate a diverse set of questions based on the captions as well as questions unrelated to the video. An LLM grounded on the captions is utilized to generate answers to the questions. A subset of the generated question-answer pairs is scored by human raters. The scores are used to train a reward model that generates scores for the entire set of question-answer pairs. High-scoring pairs are used to fine-tune a video understanding model. The described approach addresses the challenges of time, cost, and human attention span by breaking down the training data collection problem into different tasks and performing individual tasks with corresponding ML models.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Rendulic, Ivor; Tragut, Manuel; and Weisz, Ágoston, "Synthetic Data Generation for Training Video Understanding Models", Technical Disclosure Commons, (June 06, 2025)
https://www.tdcommons.org/dpubs_series/8203