A system and method are disclosed to train speech transcription models via crowdsourcing. Users of a media sharing platform may view real-time transcriptions associated with media on the user devices and identify the transcriptions as correct or incorrect. Users may determine with high accuracy correct and incorrect parts of transcribed text, using a general context of a conversation that is being transcribed. The users may select or mark blocks of transcription text and assign the selected text as correct transcription or incorrect transcription on the input user device. The system may aggregate a large amount of marked transcriptions from multiple user devices and store the marked transcriptions. The stored marked transcriptions may be used as training data for transcription and captioning models. The disclosed concept may also be extended to machine translation. The accurate training data enables development of better transcription models.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.