Monitoring the quality and integrity of the data stored in a data warehouse is necessary to ensure correct and reliable operation. Such checks can include detecting anomalous and/or impermissible values for the metrics and dimensions present in the database tables. Determining useful and effective checks and bounds is difficult and tedious, and requires high levels of expertise and familiarity with the data. This disclosure describes techniques for automating the creation of data quality checks based on examining database schema and contents to identify important dimensions and values for data quality checks. The techniques utilize the observation that, in practice, a subset of the values of a database field are likely of operational importance. These are automatically identified based on calculating importance-adjusted data quality coverage by assigning importance to metrics, dimensions, and dimension values. Data quality checks are automatically generated for effective coverage of the key dimensions and values. The generation of checks can involve selecting from a repository of historically effective checks generated by experts and/or applying time series anomaly detection to metrics in entirety or sliced by key dimension values.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Lees, W. Max; Liu, Yang; Lee, Steven; Li, Mingyang; He, Keyu; Cunningham, Emmett; Cruz, David Rissato; Ezete, Chioma; and Wu, Eric, "Automated Generation of Data Quality Checks by Identifying Key Dimensions and Values", Technical Disclosure Commons, (November 18, 2021)