Abstract
Systems for extracting specific data, such as a transaction total from a web page, may face challenges with accuracy and scalability when using heuristic-based methods like regular expressions. A described technique may utilize a client-server system where a client application on a computing device (e.g., a smartphone, smart watch, or laptop) can generate a compact, text-based representation of a web page’s rendered content. This representation can be transmitted to a remote server where a large language model, potentially guided by an engineered prompt, can analyze the content to semantically identify and extract desired data, such as a checkout amount. The system can return this data in a structured format to the client. This approach may improve data extraction accuracy and scalability across various websites by using contextual understanding rather than more rigid, site-specific rules, potentially reducing the maintenance burden associated with some rule-based systems.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Garg, Shiva; Shah, Timir; Sun, James; Zou, Yifan; and Yin, Longsheng, "Large Language Model-Based Data Extraction from Web Pages Using a Compact Content Representation", Technical Disclosure Commons, (January 02, 2026)
https://www.tdcommons.org/dpubs_series/9102