Abstract
Artificial intelligence models can experience difficulty understanding and generating code-switched text, and some existing synthetic data generation methods can produce grammatically incorrect or unnatural-sounding results. This disclosure describes a multi-stage system for generating code-switched text from monolingual corpora. The method can employ a probabilistic constraint model to analyze a source sentence and identify syntactically and semantically appropriate points for a language switch. Based on these identified points, a candidate generation engine can create one or more raw code-switched sentences. A large generative model may then refine these raw candidates to address potential grammatical errors at the language junction and improve overall fluency. The process can be used to produce large-scale, code-switched datasets for training multilingual AI systems.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Kumar S R, Mithun, "Generating Context-Aware Code-Switched Text Using Probabilistic Modeling and Generative Refinement", Technical Disclosure Commons, (October 27, 2025)
https://www.tdcommons.org/dpubs_series/8786