Abstract
Video understanding using multimodal large language models (LLMs) includes recognizing objects, actions, scenes, and extracting meaningful insights from video streams. To reduce computational burden, a subset of frames sampled from the video are fed to the LLM. However, this can reduce the accuracy of LLM inference when the video includes salient information in a small set of frames and is wasteful when the video has slow-moving scenes. This disclosure describes a dynamic subsampling technique that can be used to select the most salient frames with a higher likelihood. Specifically, attention-guided frame selection, 3D convolutional feature extraction, and entropy-based subspace projection are utilized to ensure that the most important information from the video is fed to the LLM. The techniques reduce the number of frames to be processed by LLM compared to fixed frame rate sampling while also improving inference accuracy.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Labrador, Beltrán; Stephanov, Georgi; Stuken, Yury; Sulser, Fabio; Akolzin, Ilia; Siegenthaler, Olivier; Weisz, Ágoston; and Tragut, Manuel, "Dynamic Frame Sampling for Multimodal Large Language Model Video Understanding", Technical Disclosure Commons, (May 08, 2025)
https://www.tdcommons.org/dpubs_series/8101