Abstract
Evaluating foundation models using a single aggregated metric, such as a weighted sum, presents significant limitations. These methods can mask catastrophic regressions in individual capabilities, establish arbitrary trade-offs between metrics, and fail to distinguish true improvements, where all members of a set have increases ≥0 (Pareto improvements), from changes where some metrics decline. This disclosure details a method for robust metric aggregation. The technique computes a composite score using a more Pareto-sensitive function (harmonic mean, geometric mean, etc.) as a base aggregator, which inherently penalizes low sub-metric values. A trade-off penalty is calculated based on negative deltas from a baseline, and a catastrophe detector flags sub-metrics falling below predefined thresholds. The purpose is to provide a more-reliable single-quality indicator that penalizes trade-offs and clearly signals catastrophic performance drops, thereby mitigating risks associated with metric gaming during model optimization.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Wang, Jingtao, "Robust Metric Aggregation for Foundation Models", Technical Disclosure Commons, (November 20, 2025)
https://www.tdcommons.org/dpubs_series/8912