Abstract
A common task for companies is to find the provenance of binary files that they receive from an upstream source. Several methods exist to fingerprint a binary, including string searches, looking at function names, using code clone detection, and so on. Some of these methods use a scoring mechanism or chop a file up in smaller pieces, compute a hash and look hashes up in a database. This document describes a method to turn a sequence of strings and other identifiers into a locality sensitive hash using TLSH, which can then be searched in a special datastructure called Vantage Point Tree (VPT) to quickly find a closest match in a collection of TLSH hashes of known files. A close match (where “close” is indicated by a threshold value) means that the file that is found has a (near) identical set of identifiers and is therefore likely to be made from the same source code software.
Keywords: proximity matching, open source license compliance, provenance, tlsh
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Hemel, Armijn, "Finding a closest match for an ELF file based on proximity matching of extracted identifiers", Technical Disclosure Commons, (May 24, 2022)
https://www.tdcommons.org/dpubs_series/5155