Restaurant menu images can be utilized to automatically obtain structured data about dish names, prices, etc. However, the raw optical character recognition (OCR) output suffers from low quality and OCR techniques do not have sufficient ability to adapt to the diversity in language and design of restaurant menus. A language model can be used together with OCR to identify dish names and other content through a named entity recognition (NER) process. However, this is not scalable due to the requirement of a large, labeled dataset across languages and countries. This disclosure describes the use of a multimodal large language model (LLM) to automatically generate digital structured menus from restaurant menu photographs. The use of a multimodal large language model enables automatic creation of structured digital menus that include price, description, ingredients, etc. without the requirement of a large amount of labeled data and can also overcome difficulties associated with low quality photographs. The capabilities of multimodal LLMs are leveraged by formulating the task of menu understanding from the user-provided photos as a multimodal information extraction or a visual question answering task which fits naturally with the framework of multimodal pretrained large models.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.