Abstract

Manual transcription of information from unstructured visual documents, such as utility bills or event flyers, for digital transactions may be inefficient and prone to error. A system is described for automating transactional workflows initiated from a visual input. A user may provide an image of a document to a multimodal large language model, which can analyze the image to perform semantic analysis, extract relevant transactional data like a payee and an amount, and infer a user's intent. This structured information can then be provided to an agentic controller that can programmatically orchestrate a sequence of actions. These actions can include navigating a payment website, populating data fields, integrating with a payment service for user confirmation, and performing post-transaction tasks such as saving a digital receipt, thereby potentially reducing manual intervention in completing the transaction.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS