Abstract

Artificial intelligence models can experience difficulty understanding and generating code-switched text, and some existing synthetic data generation methods can produce grammatically incorrect or unnatural-sounding results. This disclosure describes a multi-stage system for generating code-switched text from monolingual corpora. The method can employ a probabilistic constraint model to analyze a source sentence and identify syntactically and semantically appropriate points for a language switch. Based on these identified points, a candidate generation engine can create one or more raw code-switched sentences. A large generative model may then refine these raw candidates to address potential grammatical errors at the language junction and improve overall fluency. The process can be used to produce large-scale, code-switched datasets for training multilingual AI systems.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS