Abstract

With burgeoning data volumes, the need for accurate data cataloging, lineage, and governance is gaining importance. Traditional automation for data context discovery requires manual tagging or scanning of data using a parser to understand the content, which can require knowledge of the content type and can be error-prone, time-consuming, and expensive. Further, traditional data context discovery doesn’t track lineage between disparate data types, and cannot index binary data. This disclosure describes efficient techniques to index and map similarity across multiple datasets to determine data lineage. The techniques do not need or use the context of the underlying data or concepts therein. A rolling hash is created of the multiple datasets whose lineage is sought. The resulting hash streams, serving as indexes for their data, are compared using, e.g., a search engine. Similarity in hash streams is used to establish lineage.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Silverberg, Sam, "Detecting Data Lineage Using Variable Block Deduplication", Technical Disclosure Commons, (November 18, 2021)
https://www.tdcommons.org/dpubs_series/4730

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Detecting Data Lineage Using Variable Block Deduplication

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Detecting Data Lineage Using Variable Block Deduplication

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information