Abstract
Automated data extraction from diverse web resources may result in inaccuracies, as single-method systems can struggle with varied implementation patterns, such as client-side rendering. This paper describes systems and methods that utilize a trustability-based framework to address these potential inaccuracies. This framework can employ a set of diverse extractors, for example, code-based, image-based, and machine learning-based modules, that may operate concurrently. The performance of these extractors can be evaluated against a ground-truth dataset, which may be generated by human reviewers from a sample of web pages for a given domain. The extractors that meet a pre-defined accuracy threshold can be certified as trusted for that domain. For large-scale extraction, a consensus mechanism may synthesize a final data point from the outputs of the certified extractors. This approach can improve the fidelity of extracted dynamic data, such as product price and availability, at scale and adapt to changes in web page design over time.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Janeiro, Jordan and Novikov, Sergey, "System for Data Extraction via Consensus of Extractors Vetted by Ground Truth", Technical Disclosure Commons, (December 11, 2025)
https://www.tdcommons.org/dpubs_series/9016