Abstract

It is possible to provide an arbitrary image (e.g., in PDF, JPG, or any other format with handwritten or printed text in arbitrary orientations and languages) to an LLM and instruct it to extract text from the image. While an LLM can perform this task, the performance can be unsatisfactory and variable. This disclosure leverages reinforcement learning with machine feedback (RLMF) to improve the accuracy of an LLM when performing image-to-text conversion tasks. Per the techniques, known documents (where the groundtruth text content is known) and/or generated documents with text in a variety of fonts (and other parameters, such as script, orientation, size, etc.) are turned into images. An LLM is tasked with extracting text from the images. The extracted text is compared with the groundtruth to determine the number of mistakes. A machine-based reward model is created that trains the model based on the number of mistakes.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS