In educational videos, the speaker often presents a set of slides that serve as logical cues and content markers. Currently, video platforms do not use the strong cues already available in the form of slides. This disclosure describes techniques that enable video viewers to more naturally navigate the video. With user permission, computer vision techniques are applied to video content to detect whether it includes a presentation, to track individual slide changes, and to recognize content displayed on each slide. Automatic understanding of slide content is utilized to improve speech recognition and content captioning, to enhance video search, and to provide improved user interfaces, e.g., a table-of-contents for the video.

