Abstract

Systems, methods, and computer program products are provided for evaluating generative machine learning models. An example system includes a processor configured to determine queries and ground-truth answers associated with the queries. The processor is also configured to generate generated answers based on the queries using a model that is being evaluated. The processor is further configured to input the queries, the ground-truth answers, and the generated answers to an evaluator model trained to evaluate the generated answers in comparison to the ground-truth answers and based on a grading scale associated with accuracy, honesty, and completeness of a generated answer. The processor is further configured to determine scores associated with the generated answers, reject a first subset of answers based on a first subset of scores, and provide a second subset of answers based on a second subset of scores.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS