Defensive Publications Series

Augmenting Large Language Models with Audio Generation Capabilities

Ágoston WeiszFollow
Timo DenkFollow
Mauricio ZuluagaFollow
Alessandro AgostiniFollow
Christian FrankFollow
Olivier SiegenthalerFollow

Abstract

Chatbots or conversational agent interfaces utilize large language models (LLMs) to provide text responses to user queries. However, such chatbots are not capable of receiving audio input and providing generated audio as a response. This disclosure describes techniques to augment a LLM with an interface to an audio generation model. The LLM is fine-tuned to train it to leverage an API to access the audio generation model when input queries request query response in audio form. The trained LLM performs reasoning tasks and generates prompts for the audio generation model. The user-provided audio input and the LLM-generated prompts are fed to the audio generation model which generates audio. The output audio is analyzed to determine attributes as textual description. The LLM can perform multiple rounds of reasoning, prompt generation, and calling the audio generation model based on previously generated audio and associated textual descriptions. The ultimate audio output as generated by the audio generation model is provided as a response to the user query.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Weisz, Ágoston; Denk, Timo; Zuluaga, Mauricio; Agostini, Alessandro; Frank, Christian; and Siegenthaler, Olivier, "Augmenting Large Language Models with Audio Generation Capabilities", Technical Disclosure Commons, (February 08, 2024)
https://www.tdcommons.org/dpubs_series/6668

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Augmenting Large Language Models with Audio Generation Capabilities

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Augmenting Large Language Models with Audio Generation Capabilities

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information