Abstract
It is possible to provide an arbitrary image (e.g., in PDF, JPG, or any other format with handwritten or printed text in arbitrary orientations and languages) to an LLM and instruct it to extract text from the image. While an LLM can perform this task, the performance can be unsatisfactory and variable. This disclosure leverages reinforcement learning with machine feedback (RLMF) to improve the accuracy of an LLM when performing image-to-text conversion tasks. Per the techniques, known documents (where the groundtruth text content is known) and/or generated documents with text in a variety of fonts (and other parameters, such as script, orientation, size, etc.) are turned into images. An LLM is tasked with extracting text from the images. The extracted text is compared with the groundtruth to determine the number of mistakes. A machine-based reward model is created that trains the model based on the number of mistakes.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Weisz, Ágoston and Salama, Khalid, "RLMF Training of LLM for Optical Character Recognition Tasks", Technical Disclosure Commons, (June 11, 2025)
https://www.tdcommons.org/dpubs_series/8224