A method for providing a text-to-speech framework that generated speech that mimics a user’s voice is disclosed. The proposed method receives sample speech from the user, and generates speaker embeddings specific to the user. The speaker embeddings are generated using a neural network. The speaker embeddings are used to fine-tune a generative vocoder. The finetuned generative vocoder can be used to generate speech that mimics the speech patterns and vocal characteristics of the user. Thus, text entered by the user can be converted to audio that sounds like the user’s speech. The generated audio is then transmitted to other participants in a virtual meeting.

