Abstract

Methods, systems, and computer program products are provided for difficulty-driven, temporal adaption of direct preference optimization of a large language machine learning model. An example method includes receiving pairs of responses associated with text prompts, determining a probability margin score for a pair of responses, computing a measure of semantic similarity associated with the pair of responses, computing a difficulty score for the pair of responses based on the probability margin score, computing a time-dependent temperature parameter for the pair of responses based on a minimum time-dependent temperature parameter, a maximum time-dependent temperature parameter, a time parameter associated with training a large language machine learning model, and the difficulty score for the first pair of responses, calculating a measure loss for the pair of responses based on the time-dependent temperature parameter for the first pair of responses, and updating the large language machine learning model based on the measure loss.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS