Abstract

This disclosure introduces a method for embedding fingerprints of training datasets used for model training into machine learning model checkpoints and signatures. This allows for auditable lineage tracing of datasets used in model training, even for models that are derived from a base model (or part of a sequence). Each authorized dataset is registered with a unique identifier that includes its size, number of examples, and provenance. The dataset fingerprint is carried over to subsequent models, along with new fingerprints for any additional datasets used for training of the subsequent models. A checkpoint saving system cross-validates training dataset size against recorded fingerprints, failing if there's a mismatch. This technique enables auditing models for compliance. The techniques provide a systematic and auditable way to track dataset usage, thereby enabling greater trust and reliability in released models. The techniques help organizations that build and provide machine learning models to maintain a record of dataset usage, mitigating legal and financial risks.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Max, Lenord Melvix Joseph Stephen, "Dataset Identifiers in Machine Learning Model Signature for Lineage Tracing", Technical Disclosure Commons, (July 30, 2025)
https://www.tdcommons.org/dpubs_series/8408

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Dataset Identifiers in Machine Learning Model Signature for Lineage Tracing

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Dataset Identifiers in Machine Learning Model Signature for Lineage Tracing

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information