Abstract
Manual transcription of information from unstructured visual documents, such as utility bills or event flyers, for digital transactions may be inefficient and prone to error. A system is described for automating transactional workflows initiated from a visual input. A user may provide an image of a document to a multimodal large language model, which can analyze the image to perform semantic analysis, extract relevant transactional data like a payee and an amount, and infer a user's intent. This structured information can then be provided to an agentic controller that can programmatically orchestrate a sequence of actions. These actions can include navigating a payment website, populating data fields, integrating with a payment service for user confirmation, and performing post-transaction tasks such as saving a digital receipt, thereby potentially reducing manual intervention in completing the transaction.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Dayanand, Dinoop; Tejasvi, Ravi; Kota, Nithya; Chhatbar, Hemen; and Chiu, Adam, "Automated Transactional Workflow Orchestration from Visual Documents", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/9731