Maintenance is a dominant component of software cost, and localizing reported defects is a significant component of maintenance. We propose a scalable approach that leverages the natural language present in both defect reports and source code to identify files that are potentially related to the defect in question. Our technique is language-independent and does not require test cases. The approach represents reports and code as separate structured documents and ranks source files based on a document similarity metric that leverages inter-document relationships. We evaluate the fault-localization accuracy of our method against both lightweight baseline techniques and also reported results from state-of-the-art tools. In an empirical evaluation of 5345 historical defects from programs totaling 6.5 million lines of code, our approach reduced the number of files inspected per defect by over 91%. Additionally, we qualitatively and quantitatively examine the utility of the textual and surface features used by our approach.

Author: Zachary P. Fry; Westley Weimer

Source: https://archive.org/

