Abstract

Efficient approximate digital content discovery and identification at scale is an important and complex topic.

This proposed method for matching approximate digital content and text fragments consists of these three elements:

1. Content-defined chunking algorithm to split content into fragments and select a subset of these fragments

2. Fingerprinting fragments with a locality sensitive hashing (LSH) function for approximate matching of these fragments. This fingerprint can embedmultiple precision in a single bit string to tune the search precision and organize the search in rounds of progressively increasing precision.

3. Indexing-time approximate matching for deduplication where each new content fragment is added to the index if it is not already matchable approximately in the index, avoiding large duplications of content entries.

This publication describes the general context and problems this method attempts to resolve, and describes the method itself. It is completed by examples of applications with AI-Generated Code Search and Software Origin Discovery for Software Supply Chain Security. The methods described here have other applications beyond these including training data content deduplication for AI and LLMs.

Creative Commons License

This work is licensed under a Creative Commons Attribution-Share Alike 4.0 License.

Recommended Citation

Ombredanne, Philippe, "Method for Approximate Content Discovery", Technical Disclosure Commons, (November 21, 2024)
https://www.tdcommons.org/dpubs_series/7579

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Method for Approximate Content Discovery

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Method for Approximate Content Discovery

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information