Abstract
Efficient approximate digital content discovery and identification at scale is an important and complex topic.
This proposed method for matching approximate digital content and text fragments consists of these three elements:
1. Content-defined chunking algorithm to split content into fragments and select a subset of these fragments
2. Fingerprinting fragments with a locality sensitive hashing (LSH) function for approximate matching of these fragments. This fingerprint can embedmultiple precision in a single bit string to tune the search precision and organize the search in rounds of progressively increasing precision.
3. Indexing-time approximate matching for deduplication where each new content fragment is added to the index if it is not already matchable approximately in the index, avoiding large duplications of content entries.
This publication describes the general context and problems this method attempts to resolve, and describes the method itself. It is completed by examples of applications with AI-Generated Code Search and Software Origin Discovery for Software Supply Chain Security. The methods described here have other applications beyond these including training data content deduplication for AI and LLMs.
Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 4.0 License.
Recommended Citation
Ombredanne, Philippe, "Method for Approximate Content Discovery", Technical Disclosure Commons, (November 21, 2024)
https://www.tdcommons.org/dpubs_series/7579