Automatic speech recognition (ASR) machine learning models are used to recognize spoken commands or queries from users. End-to-end ASR models, which directly map a sequence of input acoustic features into a sequence of words, greatly simplify ASR system building and maintenance. This disclosure describes techniques to improve the performance of end-to-end ASR models by providing predicted user intents as additional inputs. Intent prediction vectors or intent embedding is generated based on user-permitted contextual features using a trained intent prediction network (IPN). The IPN can be trained independently from the ASR model or jointly with the ASR model. Training of the IPN can be performed based on training data that includes user-permitted contextual features, even when such data does not include speech data. The IPN can be retrained when the available contextual feature set changes.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Caseiro, Diamantino; Wu, Zelin; Aleksic, Petar; and Jain, Era, "Intent Prediction Based On Contextual Factors For Better Automatic Speech Recognition", Technical Disclosure Commons, (March 26, 2021)