Machine learning models that perform grammar error correction (GEC) suffer from insufficient training data. This disclosure describes techniques that automatically generate a large corpus of training data for GEC and other natural language processing tasks. With specific user permission, the techniques leverage the edit histories of documents by identifying changes to documents attributable to grammatical corrections by users. The training set for the GEC machine learning model is automatically augmented by sentences known to be ungrammatical (e.g., original text, before revision by user) or grammatical (e.g., text after revision by user), and labeled as such. The techniques enable the provision of a very large corpus of training data for grammar error-correcting or other natural language processing ML models.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Gupta, Shruti; Gulati, Anmol; and Hoskere, Jayakumar, "Automatic Generation of Training Corpus for Natural Language Processing Tasks", Technical Disclosure Commons, (October 06, 2020)