While speech input, detected by on-device microphones, enables users to provide voice commands or other text at a higher speed than typing, the accuracy of transcription can suffer in certain environments, e.g., in the presence of background noise. Gesture input via a virtual keyboard is convenient but slower and also prone to errors, e.g., when a gesture resolves to multiple words. Per techniques of this disclosure, a user can provide spoken and gesture input at the same time to a user device. Each input is converted to text separately. The stereo text stream is time aligned using word-to-word distance to obtain time-aligned query text. The time-aligned query text is semantically filtered to generate text conversion of the user input. The techniques can be implemented on any user device and used to process user input such as queries provided to a virtual assistant or other applications on the user device. The fusion of the voice and gesture modes can increase the accuracy, speed, and reliability of providing text input.

