Pre-recorded videos provide a non-interactive viewing experience that is suboptimal for learning or engagement. This disclosure describes techniques that leverage large language models (LLMs) and generative video models to enable viewers of an informational/educational video to pause the video and ask questions to the presenter (the person in the video). The question is analyzed, and an answer is generated using an LLM. The LLM may be fine-tuned to the persona of the presenter of the video such that the generated answer matches their style. The answer is delivered with the face and the voice of the presenter in the original video, while clearly indicating that it is synthesized. For example, an AI-simulated presenter video is generated using a voice model that renders the answer generated by the LLM the presenter’s voice and with video of the AI-simulated presenter that is lip-synchronized to the generated audio. When the Q&A interaction is completed, the viewer can resume video playback of the source video. In this manner, a non-interactive video is transformed into a conversational, audiovisual experience.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.