Abstract

Machine classifiers are typically trained using labeled data sets. If the training data set has categories of objects that naturally co-occur, the machine classifier may have difficulty in distinguishing those categories. For example, audio streams often contain instances of sounds that occur simultaneously; e.g., speech and laughter. In this example, the different sounds are the objects that are to be classified. A machine classifier trained with such audio streams generates false positives; e.g., conflates speech with laughter, if the training data set does not label speech separately from laughter. The difficulty of obtaining well-labeled training sets compounds the problem of misclassification. For example, most transcriptions of audio streams containing laughter also include speech in close proximity, since laughter occurs just after speech; e.g., at the end of a joke. Furthermore, humans that produce training data typically annotate rather long audio segments at once, without specifying precise times for each word or audio event, so segments that contain laughter typically include both “speech” and “laughter” without labeling exactly when each occurred. This disclosure describes techniques to improve classification accuracy that are applicable for machine classifiers that act on any type of data; e.g., video, documents, images, etc.

Share

COinS