In online search or content selection systems, significant computational resources are expended to classify or categorize electronic documents into topics, concepts, or entities. A classifier can process, parse or otherwise analyze the document to assign one or more labels to the document based on the taxonomy. The classifier can generate a score for each of the labels, and provide the labels and the scores to other components or modules for further downstream processing. To keep downstream processes efficient without causing excessive processing of labels, the classifier may filter out the labels to return a subset of labels based on comparing a label’s score with a threshold. However, using a threshold-based technique to filter out labels may not account for the tree structure of the taxonomy, and it may also fail to take into account the likelihood dependencies between all parent nodes and child nodes. The proposed technique solves this by (1) selecting a set of labels returned by the classifier that optimizes certain metrics, such as precision and recall metrics; and (2) using a greedy multi-label selection algorithm that optimizes the precision/recall in step (1). Using these techniques, the system can select a subset of labels to return or provide for further processing.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Lin, Hsin-yi; Milch, Brian; Adar, Michel; Fang, Scot; and van de Veerdonk, Rene, "Threshold-free Selection of Taxonomic Multilabels", Technical Disclosure Commons, (November 04, 2016)