Abstract

The presence of duplicate entries within structured data needlessly consumes storage resources and causes inefficiencies by requiring knowledge workers to process the same information multiple times. Currently, identifying and removing such duplicate entries is based on exact lexical matches which fails to detect duplicates that are lexically different but semantically identical. This disclosure describes techniques to automate the identification and removal of semantically duplicate entries within structured data at scale by employing generative artificial intelligence (GenAI) techniques. Per the techniques, named and typed values extracted from a given tuple within the structured data are processed using a GenAI model to identify and drop content that is semantically irrelevant for the task, while retaining the original order of the values in the tuple. Hashes of the processed tuples can then be compared to identify and remove duplicated entries with a low computational cost, i.e., linear time complexity. The operation can be easily customized to support the identification and removal of semantic matches for any tasks that involve working with structured data.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Yim, K. S., "Automated Scalable Identification and Removal of Duplicates Within Structured Data", Technical Disclosure Commons, (April 09, 2025)
https://www.tdcommons.org/dpubs_series/7985

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Automated Scalable Identification and Removal of Duplicates Within Structured Data

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Automated Scalable Identification and Removal of Duplicates Within Structured Data

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information