With burgeoning data volumes, the need for accurate data cataloging, lineage, and governance is gaining importance. Traditional automation for data context discovery requires manual tagging or scanning of data using a parser to understand the content, which can require knowledge of the content type and can be error-prone, time-consuming, and expensive. Further, traditional data context discovery doesn’t track lineage between disparate data types, and cannot index binary data. This disclosure describes efficient techniques to index and map similarity across multiple datasets to determine data lineage. The techniques do not need or use the context of the underlying data or concepts therein. A rolling hash is created of the multiple datasets whose lineage is sought. The resulting hash streams, serving as indexes for their data, are compared using, e.g., a search engine. Similarity in hash streams is used to establish lineage.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Silverberg, Sam, "Detecting Data Lineage Using Variable Block Deduplication", Technical Disclosure Commons, (November 18, 2021)