This disclosure presents a system and method to select unsupervised data for speech processing using the distribution of supervised data in an embedded space. The data sets are represented by different colors to differentiate between supervised and unsupervised utterances. The system samples a set of utterances from the unsupervised data, such that the distribution of the unsupervised sample matches with the distribution of the supervised utterances. The sampling method converts the data sets into bins in a two-dimensional histogram, which is then normalized using the size of the data set for each bin. The data is then manipulated and selected so that the distribution of the data selected would closely match the distribution of the supervised data set. The system and method generates useful unsupervised data sets that could help train speech recognition models effectively.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.