Abstract
A novel methodology is presented for enhancing the precision of machine learning models utilized in high-stakes domains, specifically systems for defense against malicious advertising activities (AdSpam defense). This approach addresses the substantial challenges posed by noisy and incomplete positive training labels common in real-world datasets. By systematically refining the positive label set, the methodology focuses on identifying and retaining only the most reliable and consistently represented spam patterns. This technique integrates model-based explainability, specifically SHapley Additive exPlanations (SHAP), with feature-based clustering to construct a high-quality training dataset. This process effectively mitigates the negative effects of ambiguous labels, unknown negative examples, and sparsely represented, long-tail spam patterns. Initial experiments demonstrate the potential for substantial improvement in model performance, exemplified by an AUC-ROC increase from 0.906 to 0.998.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Liu, Fei and Zhao, Manqi, "Enhancing Model Precision with Quality-Aware Label Selection", Technical Disclosure Commons, (March 19, 2026)
https://www.tdcommons.org/dpubs_series/9567