Abstract

Video understanding using multimodal large language models (LLMs) includes recognizing objects, actions, scenes, and extracting meaningful insights from video streams. To reduce computational burden, a subset of frames sampled from the video are fed to the LLM. However, this can reduce the accuracy of LLM inference when the video includes salient information in a small set of frames and is wasteful when the video has slow-moving scenes. This disclosure describes a dynamic subsampling technique that can be used to select the most salient frames with a higher likelihood. Specifically, attention-guided frame selection, 3D convolutional feature extraction, and entropy-based subspace projection are utilized to ensure that the most important information from the video is fed to the LLM. The techniques reduce the number of frames to be processed by LLM compared to fixed frame rate sampling while also improving inference accuracy.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS