Abstract
This disclosure describes a system for automated video chapterization and summarization that can use a large multimodal transformer architecture. The system can process a video's visual frames, audio track, and speech transcript, and may employ a hierarchical attention mechanism to capture both local and global dependencies across modalities. The model may be pre-trained on a dataset of chapterized videos using techniques such as masked modeling and a chapter prediction task. A classification head can be used to predict chapter boundaries, and a decoder network can generate summaries. This approach provides a method for generating chapters and summaries in a joint process, which may assist viewers in navigating and comprehending the content of long-form videos.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
N/A, "A Multimodal Transformer Approach to Joint Video Chapterization and Summarization", Technical Disclosure Commons, (November 17, 2025)
https://www.tdcommons.org/dpubs_series/8885