Abstract
The automated web data extraction can be challenged by time-consuming manual methods and direct data retrieval by generative models that may be unreliable due to outdated training data. A disclosed technique can address these challenges by using a generative model to analyze the document object model of web pages. Instead of extracting data content, the model can identify stable structural patterns containing desired data and can output corresponding selectors, such as cascading style sheets selectors or XPath expressions. This approach, which may involve clustering structurally similar pages before analysis, can separate structural pattern recognition from final value extraction. This process may accelerate the development of data extraction scripts and improve their resilience to modifications in website structure, thereby potentially enhancing the efficiency and reliability of data acquisition processes.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Janeiro, Jordan, "Generative Model-Assisted Generation of Structural Selectors for Web Data Extraction", Technical Disclosure Commons, (January 18, 2026)
https://www.tdcommons.org/dpubs_series/9195