Image generation models enable users to generate images by providing instructions. However, such models cannot be invoked with voice commands and are also unable to update a prior image based on the user instruction. This disclosure describes techniques that enable users to obtain and refine images by iteratively interacting with an image generation model in real time, e.g., via voice commands to a virtual assistant. Implementation of the techniques can enable users to use their voice and imagination for artistic visual expression. The techniques can be provided via a virtual assistant available via a smart speaker, smartphone, or other device. The techniques incorporate appropriateness checks for the input query and/or the output image, thus ensuring that the interactive experience is safe and trustworthy.

This work is licensed under a Creative Commons Attribution 4.0 License.