Abstract

Machine learning models that perform grammar error correction (GEC) suffer from insufficient training data. This disclosure describes techniques that automatically generate a large corpus of training data for GEC and other natural language processing tasks. With specific user permission, the techniques leverage the edit histories of documents by identifying changes to documents attributable to grammatical corrections by users. The training set for the GEC machine learning model is automatically augmented by sentences known to be ungrammatical (e.g., original text, before revision by user) or grammatical (e.g., text after revision by user), and labeled as such. The techniques enable the provision of a very large corpus of training data for grammar error-correcting or other natural language processing ML models.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Gupta, Shruti; Gulati, Anmol; and Hoskere, Jayakumar, "Automatic Generation of Training Corpus for Natural Language Processing Tasks", Technical Disclosure Commons, (October 06, 2020)
https://www.tdcommons.org/dpubs_series/3659

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Automatic Generation of Training Corpus for Natural Language Processing Tasks

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Automatic Generation of Training Corpus for Natural Language Processing Tasks

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information