Abstract
Sampling-based ground truth discovery for data asset compliance verification is disclosed. Regulated user identifiers are sampled from one or more monitored populations, including production sampling and/or controlled test users. For multiple target data assets across heterogeneous data store types, scannable columns are identified using heuristics, machine-learning predictions, and/or semantic annotations. Join queries are executed between sampled identifiers and target assets, including nested structures such as maps, arrays, and JSON, to obtain empirical observations of user data presence. Observations are evaluated against privacy expectations for asset groups using states including ALLOW, DISALLOW, ALLOW_EXCLUSIVE, and EXIST_DEFINITE, optionally incorporating dynamic carve-out and disallow lists. Verification results are rolled up to commitment-level statuses and compliance evidence identifying assets, columns, data type indicators, and timestamps is published to queryable tables. Scheduled scenario runs with configurable monitoring duration and monitoring/verification delays provide continuous verification and early detection of out-of-scope data presence.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Anonymous, "Sampling-Based Ground Truth Discovery for Data Asset Compliance Verification", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10720