Abstract
Methods, systems, and computer program products are provided for difficulty-driven, temporal adaption of direct preference optimization of a large language machine learning model. An example method includes receiving pairs of responses associated with text prompts, determining a probability margin score for a pair of responses, computing a measure of semantic similarity associated with the pair of responses, computing a difficulty score for the pair of responses based on the probability margin score, computing a time-dependent temperature parameter for the pair of responses based on a minimum time-dependent temperature parameter, a maximum time-dependent temperature parameter, a time parameter associated with training a large language machine learning model, and the difficulty score for the first pair of responses, calculating a measure loss for the pair of responses based on the time-dependent temperature parameter for the first pair of responses, and updating the large language machine learning model based on the measure loss.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Wu, Ziwei; Islam, Rashidul; and Cai, Yiwei, "METHOD, SYSTEM, AND COMPUTER PROGRAM PRODUCT FOR DIFFICULTY-DRIVEN TEMPORAL ADAPTATION OF DIRECT PREFERENCE OPTIMIZATION", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10277