Chatbots or conversational agent interfaces utilize large language models (LLMs) to provide text responses to user queries. However, such chatbots are not capable of receiving audio input and providing generated audio as a response. This disclosure describes techniques to augment a LLM with an interface to an audio generation model. The LLM is fine-tuned to train it to leverage an API to access the audio generation model when input queries request query response in audio form. The trained LLM performs reasoning tasks and generates prompts for the audio generation model. The user-provided audio input and the LLM-generated prompts are fed to the audio generation model which generates audio. The output audio is analyzed to determine attributes as textual description. The LLM can perform multiple rounds of reasoning, prompt generation, and calling the audio generation model based on previously generated audio and associated textual descriptions. The ultimate audio output as generated by the audio generation model is provided as a response to the user query.

