Inventor(s)

Jingtao WangFollow

Abstract

Evaluating foundation models using a single aggregated metric, such as a weighted sum, presents significant limitations. These methods can mask catastrophic regressions in individual capabilities, establish arbitrary trade-offs between metrics, and fail to distinguish true improvements, where all members of a set have increases ≥0 (Pareto improvements), from changes where some metrics decline. This disclosure details a method for robust metric aggregation. The technique computes a composite score using a more Pareto-sensitive function (harmonic mean, geometric mean, etc.) as a base aggregator, which inherently penalizes low sub-metric values. A trade-off penalty is calculated based on negative deltas from a baseline, and a catastrophe detector flags sub-metrics falling below predefined thresholds. The purpose is to provide a more-reliable single-quality indicator that penalizes trade-offs and clearly signals catastrophic performance drops, thereby mitigating risks associated with metric gaming during model optimization.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS