Defensive Publications Series

Automated Generation of Data Quality Checks by Identifying Key Dimensions and Values

W. Max LeesFollow
Yang LiuFollow
Steven LeeFollow
Mingyang LiFollow
Keyu HeFollow
Emmett CunninghamFollow
David Rissato CruzFollow
Chioma EzeteFollow
Eric WuFollow

Abstract

Monitoring the quality and integrity of the data stored in a data warehouse is necessary to ensure correct and reliable operation. Such checks can include detecting anomalous and/or impermissible values for the metrics and dimensions present in the database tables. Determining useful and effective checks and bounds is difficult and tedious, and requires high levels of expertise and familiarity with the data. This disclosure describes techniques for automating the creation of data quality checks based on examining database schema and contents to identify important dimensions and values for data quality checks. The techniques utilize the observation that, in practice, a subset of the values of a database field are likely of operational importance. These are automatically identified based on calculating importance-adjusted data quality coverage by assigning importance to metrics, dimensions, and dimension values. Data quality checks are automatically generated for effective coverage of the key dimensions and values. The generation of checks can involve selecting from a repository of historically effective checks generated by experts and/or applying time series anomaly detection to metrics in entirety or sliced by key dimension values.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Lees, W. Max; Liu, Yang; Lee, Steven; Li, Mingyang; He, Keyu; Cunningham, Emmett; Cruz, David Rissato; Ezete, Chioma; and Wu, Eric, "Automated Generation of Data Quality Checks by Identifying Key Dimensions and Values", Technical Disclosure Commons, (November 18, 2021)
https://www.tdcommons.org/dpubs_series/4731

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Automated Generation of Data Quality Checks by Identifying Key Dimensions and Values

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Automated Generation of Data Quality Checks by Identifying Key Dimensions and Values

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information