Inventor(s)

K. S. YimFollow

Abstract

The presence of duplicate entries within structured data needlessly consumes storage resources and causes inefficiencies by requiring knowledge workers to process the same information multiple times. Currently, identifying and removing such duplicate entries is based on exact lexical matches which fails to detect duplicates that are lexically different but semantically identical. This disclosure describes techniques to automate the identification and removal of semantically duplicate entries within structured data at scale by employing generative artificial intelligence (GenAI) techniques. Per the techniques, named and typed values extracted from a given tuple within the structured data are processed using a GenAI model to identify and drop content that is semantically irrelevant for the task, while retaining the original order of the values in the tuple. Hashes of the processed tuples can then be compared to identify and remove duplicated entries with a low computational cost, i.e., linear time complexity. The operation can be easily customized to support the identification and removal of semantic matches for any tasks that involve working with structured data.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS