Systems and methods for text to video conversion are provided. The system may receive text representing a natural language description for generating a video. The system may also provide the text as input to a neural network. The neural network may include a first component comprising a text to image model trained to generate one or more images based on an input text. The neural network may further include a second component comprising one or more spatiotemporal layers trained to generate a video based on the one or more images generated by the first component. The neural network may further include a third component comprising a frame interpolation network trained to increase the number of video frames of the video generated by the second component. The neural network may further include a fourth component configured to perform super-resolution across spatial and temporal dimensions. The system may also execute the neural network to generate the video representing the text.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.