Audio transcription systems typically require that the language being spoken be specified explicitly such that a language-specific transcription technique can be employed. In cases where the spoken language is not explicitly indicated, a possible approach is to process the audio via all available transcription techniques and choose the transcription associated with the highest confidence. Such an approach does not scale to support a large number of natural languages and is computationally expensive. This disclosure describes automatic identification of the natural language being spoken in audio input. The audio is processed using a trained machine learning model to output a language code corresponding to the language being spoken. Such a two-step approach, with a language identification step preceding the transcription step, enables supporting a large number of natural languages without incurring computational costs, latencies, and inaccuracies of employing multiple transcription techniques in parallel.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.