Abstract

Evaluating foundation models using a single aggregated metric, such as a weighted sum, presents significant limitations. These methods can mask catastrophic regressions in individual capabilities, establish arbitrary trade-offs between metrics, and fail to distinguish true improvements, where all members of a set have increases ≥0 (Pareto improvements), from changes where some metrics decline. This disclosure details a method for robust metric aggregation. The technique computes a composite score using a more Pareto-sensitive function (harmonic mean, geometric mean, etc.) as a base aggregator, which inherently penalizes low sub-metric values. A trade-off penalty is calculated based on negative deltas from a baseline, and a catastrophe detector flags sub-metrics falling below predefined thresholds. The purpose is to provide a more-reliable single-quality indicator that penalizes trade-offs and clearly signals catastrophic performance drops, thereby mitigating risks associated with metric gaming during model optimization.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Wang, Jingtao, "Robust Metric Aggregation for Foundation Models", Technical Disclosure Commons, (November 20, 2025)
https://www.tdcommons.org/dpubs_series/8912

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Robust Metric Aggregation for Foundation Models

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Robust Metric Aggregation for Foundation Models

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information