A collection of code bases that include source-destination pairs of code, translated from a source environment to a destination environment can be highly valuable for training artificial intelligence (AI) or machine learning (ML) models. However, it is possible that a code base includes private or sensitive information such as variable names specific to a particular party, which makes it infeasible for such use. This disclosure describes techniques to automatically remove sensitive information from code to make the code amenable for use as training data for machine learning (ML) or artificial intelligence (AI) models. Source-destination pairs of translated code are transformed into their corresponding abstract syntax trees (AST). The ASTs are anonymized such that they hold syntactic representations of the code while excising semantic information. The AASTs of source-destination code pairs can serve as a safe, shared corpus of data that can be leveraged to train AI/ML models.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.