A system and method are disclosed that enables multiple text input modes for a device without explicitly specifying the desired mode. A machine learning (ML) model is used to handle and interpret the inputs (text, voice, handwritten, etc.). The ML model analyzes sequences of data, and trains itself so that the correct final output sequence is given to the application requiring the text input. A decoder is used to combine the output of the sequence interpretation model with other knowledge sources such as character or word recognition models. Then, a language-model is used in a decoder to obtain the most likely sequence of words or characters given all user inputs. The recognition of the most likely inputs improves and enables automatic mode selection, eliminating explicit segmentation between the modalities.

