Inventor(s)

N/AFollow

Abstract

This disclosure describes a system for automated video chapterization and summarization that can use a large multimodal transformer architecture. The system can process a video's visual frames, audio track, and speech transcript, and may employ a hierarchical attention mechanism to capture both local and global dependencies across modalities. The model may be pre-trained on a dataset of chapterized videos using techniques such as masked modeling and a chapter prediction task. A classification head can be used to predict chapter boundaries, and a decoder network can generate summaries. This approach provides a method for generating chapters and summaries in a joint process, which may assist viewers in navigating and comprehending the content of long-form videos.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS