A common task for companies is to clear software source code files for legal or security reasons before they can be used by the software developers. The clearing process is tool driven, using tools such as code clone detectors/snippet matchers, license scanners and security scanners. Typically the clearning process starts from 0 for each new file that is analyzed and the fact that open source software is changed incrementally most of the time, and the software being scanned will likely be nearly identical to previously seen software, is not used. For a (large) subset of files it is possible to use this characteristic to (semi-)automate this process. When scanning a new file, first find a closest file in a set of known files, compute the difference to the known file, checking where the difference in the file is and use rules to determine what action to take depending on where the difference in the file is.
Keywords: proximity matching, open source license compliance, clearing, tlsh, srcml
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Hemel, Armijn, "Automated clearing of software source code files using proximity matching and parsing file contents", Technical Disclosure Commons, (March 14, 2022)