Abstract

This disclosure describes a system for automated video chapterization and summarization that can use a large multimodal transformer architecture. The system can process a video's visual frames, audio track, and speech transcript, and may employ a hierarchical attention mechanism to capture both local and global dependencies across modalities. The model may be pre-trained on a dataset of chapterized videos using techniques such as masked modeling and a chapter prediction task. A classification head can be used to predict chapter boundaries, and a decoder network can generate summaries. This approach provides a method for generating chapters and summaries in a joint process, which may assist viewers in navigating and comprehending the content of long-form videos.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

N/A, "A Multimodal Transformer Approach to Joint Video Chapterization and Summarization", Technical Disclosure Commons, (November 17, 2025)
https://www.tdcommons.org/dpubs_series/8885

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

A Multimodal Transformer Approach to Joint Video Chapterization and Summarization

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

A Multimodal Transformer Approach to Joint Video Chapterization and Summarization

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information