While large language models (LLMs) can generate code, training of such models has not made use of data generated during the collaborative code review process that is a standard part of software development. This disclosure describes techniques that utilize historical code review data (including reviewer comments and corresponding code edits) available within organization internal code repositories to train LLMs to generate code. The historical code review data can be used for model tuning, to train an LLM via reinforcement learning from human feedback (RLHF), and/or via prompt engineering. The trained model can be utilized to generate code starting from code description provided using a prompt template. The prompt template can incorporate organization specific factors such as developer guidelines, developer or team style, etc. Code generated by the LLM can be iteratively refined via human review as well as from analytical tools that ensure style compliance, code coverage, test success rate, comment conventions, etc.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Johnson Jr, Joseph; Hasan, Shiblee; and Koukoumidis, Emmanouil, "Using Code Review Repositories and Changelists to Train Large Language Models for Code Generation", Technical Disclosure Commons, (July 03, 2023)