During bug deduplication, selecting top bugs based on scores from a binary classification model does not work well if the model tends to return lower scores. Further, the selected top candidate bugs are not ranked based on relevance, which means that there is no mechanism to mark a bug as a duplicate without verifying each candidate. This disclosure describes automated techniques to identify ancestor bugs for a newly reported bug, to retrieve the top candidates for duplicate bugs, and rank the candidates based on a scoring function. The techniques are robust to pairwise matches between a newly reported bug and its ancestor bug failing to meet threshold scores. Rather, by relying on comparisons across the pool of bugs - open bugs as well as prior identified duplicates - along with transitive properties, the techniques automatically identify the ancestor bug in such situations. A scoring function is described that utilizes features such as number of duplicates attached to a bug, number of updates, score, number of incorrect duplicate predictions that the bug was part of, the number of affected users, etc. to rank the identified bugs.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Singh, Aman; Gupta, Vidhi; Gupta, Richa; Jain, Vineet; Mishra, Subham; and Elumalai, Babu Prasad, "Automated Techniques to Identify Most Relevant Duplicates for Bug Deduplication", Technical Disclosure Commons, (April 27, 2023)