Armijn HemelFollow


A common task for companies is to find the provenance of binary files that they receive from an upstream source. Several methods exist to fingerprint a binary, including string searches, looking at function names, using code clone detection, and so on. Some of these methods use a scoring mechanism or chop a file up in smaller pieces, compute a hash and look hashes up in a database. This document describes a method to turn a sequence of strings and other identifiers into a locality sensitive hash using TLSH, which can then be searched in a special datastructure called Vantage Point Tree (VPT) to quickly find a closest match in a collection of TLSH hashes of known files. A close match (where “close” is indicated by a threshold value) means that the file that is found has a (near) identical set of identifiers and is therefore likely to be made from the same source code software.

Keywords: proximity matching, open source license compliance, provenance, tlsh

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.