Abstract

Some Large Language Model (LLM) evaluation frameworks can provide performance metrics and diagnostic information but may not offer a mechanism to translate these insights into specific, actionable improvements. This circumstance can lead developers to manually interpret evaluation results and devise solutions, such as refining system prompts. A disclosed method relates to a closed-loop system that can be used to automate aspects of this process. In some configurations, the system identifies low-performing outputs from an evaluation. An LLM-based component, which may be called an insight generator can then analyze these failures to produce structured action items. A prompt tuner module can subsequently use these action items to iteratively modify the original system prompt. Following a modification, new outputs may be generated and re-evaluated in a cycle, with the objective of meeting a desired performance goal. This process can serve as a method for improving LLM performance by refining prompts based on empirical feedback.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS