Abstract
Systems and methods for text to video conversion are provided. The system may receive text representing a natural language description for generating a video. The system may also provide the text as input to a neural network. The neural network may include a first component comprising a text to image model trained to generate one or more images based on an input text. The neural network may further include a second component comprising one or more spatiotemporal layers trained to generate a video based on the one or more images generated by the first component. The neural network may further include a third component comprising a frame interpolation network trained to increase the number of video frames of the video generated by the second component. The neural network may further include a fourth component configured to perform super-resolution across spatial and temporal dimensions. The system may also execute the neural network to generate the video representing the text.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
"TEXT-TO-VIDEO GENERATION USING NEURAL NETWORKS", Technical Disclosure Commons, (February 11, 2024)
https://www.tdcommons.org/dpubs_series/6679