Abstract

In current techniques used for evaluation of multimodal large language models (LLMs), a significant technical challenge is presented by the use of text-centric benchmarks, which can be inappropriate for the evaluation of conversational agents that operate with an audio-in interface. This creates a modality gap, as benchmarks may not capture the nuances of a spoken, multi-turn dialog, leading to inaccurate performance assessments. This disclosure describes a systematic pipeline designed to bridge the modality gap . The pipeline enables an objective and scalable evaluation for an audio-in conversational agent, with a specific focus on instruction following and function calling capabilities. Per the techniques of this disclosure, prompts (e.g., from the evaluation set) are filtered to remove criteria irrelevant to audio-only scenarios. Modality-specific criteria are then generated for each filtered prompt. Parallel evaluation is conducted using both text and audio inputs to isolate performance degradation that is specific to the audio modality. The described techniques enable calculation of an accurate, modality-corrected performance score for the LLM under evaluation.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

NA, "Multimodal Large Language Model Evaluation with Understanding of Modality Gap", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/8893

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Multimodal Large Language Model Evaluation with Understanding of Modality Gap

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Multimodal Large Language Model Evaluation with Understanding of Modality Gap

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information