Machine generated speech transcriptions are a feature of several products such as videoconferencing software, mobile operating systems, etc. However, automatic transcribers are poor at accurately understanding some types of real world user speech. Spoken terms that are phonetically similar but have different meanings can cause errors in machine generated transcription. Although automatic transcribers evaluate various probable phrases as the spoken phrase, the analysis of sound alone is not enough to accurately recognize speech.

Per the techniques of this disclosure, a machine transcription model evaluates probable options for spoken language and evaluates the options based in part on using user-permitted available visual context. Such visual content is analyzed to determine presence of text within the image. If text is detected, OCR techniques are applied to recognize the text and the recognized text is used to improve the accuracy of transcription.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.