Inventor(s)

NAFollow

Abstract

In current techniques used for evaluation of multimodal large language models (LLMs), a significant technical challenge is presented by the use of text-centric benchmarks, which can be inappropriate for the evaluation of conversational agents that operate with an audio-in interface. This creates a modality gap, as benchmarks may not capture the nuances of a spoken, multi-turn dialog, leading to inaccurate performance assessments. This disclosure describes a systematic pipeline designed to bridge the modality gap . The pipeline enables an objective and scalable evaluation for an audio-in conversational agent, with a specific focus on instruction following and function calling capabilities. Per the techniques of this disclosure, prompts (e.g., from the evaluation set) are filtered to remove criteria irrelevant to audio-only scenarios. Modality-specific criteria are then generated for each filtered prompt. Parallel evaluation is conducted using both text and audio inputs to isolate performance degradation that is specific to the audio modality. The described techniques enable calculation of an accurate, modality-corrected performance score for the LLM under evaluation.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS