Abstract

Evaluation of search query and subquery quality can be beneficial for information retrieval systems, as some evaluation methods that use human raters may be time-consuming, costly, and difficult to scale. This disclosure describes a method and system for automating query quality evaluation that may use a prompted, off-the-shelf large language model (LLM). A process can begin by creating a small, curated "golden dataset" of positive and negative examples, for instance, from historical human ratings. This dataset can then be used to engineer a detailed prompt to guide an LLM to perform classification tasks on new query-subquery pairs, such as identifying ambiguity or relevance. The system can be validated using unseen historical data and synthetically generated examples to assess performance metrics like precision and recall. This approach may provide a scalable alternative to manual rating for generating quality labels and can operate without model fine-tuning.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS