Abstract
This disclosure introduces a method for embedding fingerprints of training datasets used for model training into machine learning model checkpoints and signatures. This allows for auditable lineage tracing of datasets used in model training, even for models that are derived from a base model (or part of a sequence). Each authorized dataset is registered with a unique identifier that includes its size, number of examples, and provenance. The dataset fingerprint is carried over to subsequent models, along with new fingerprints for any additional datasets used for training of the subsequent models. A checkpoint saving system cross-validates training dataset size against recorded fingerprints, failing if there's a mismatch. This technique enables auditing models for compliance. The techniques provide a systematic and auditable way to track dataset usage, thereby enabling greater trust and reliability in released models. The techniques help organizations that build and provide machine learning models to maintain a record of dataset usage, mitigating legal and financial risks.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Max, Lenord Melvix Joseph Stephen, "Dataset Identifiers in Machine Learning Model Signature for Lineage Tracing", Technical Disclosure Commons, (July 30, 2025)
https://www.tdcommons.org/dpubs_series/8408