Abstract

While large language models (LLMs) can generate code, training of such models has not made use of data generated during the collaborative code review process that is a standard part of software development. This disclosure describes techniques that utilize historical code review data (including reviewer comments and corresponding code edits) available within organization internal code repositories to train LLMs to generate code. The historical code review data can be used for model tuning, to train an LLM via reinforcement learning from human feedback (RLHF), and/or via prompt engineering. The trained model can be utilized to generate code starting from code description provided using a prompt template. The prompt template can incorporate organization specific factors such as developer guidelines, developer or team style, etc. Code generated by the LLM can be iteratively refined via human review as well as from analytical tools that ensure style compliance, code coverage, test success rate, comment conventions, etc.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Johnson Jr, Joseph; Hasan, Shiblee; and Koukoumidis, Emmanouil, "Using Code Review Repositories and Changelists to Train Large Language Models for Code Generation", Technical Disclosure Commons, (July 03, 2023)
https://www.tdcommons.org/dpubs_series/6027

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Using Code Review Repositories and Changelists to Train Large Language Models for Code Generation

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Using Code Review Repositories and Changelists to Train Large Language Models for Code Generation

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information