Hi Igor, It's an interesting direction to study tickets/commits in the Hadoop community.
A research group from Univ. Wisconsin did a similar study on Linux file systems and I found it quite insightful: http://research.cs.wisc.edu/wind/Publications/fsstudy-tos14.pdf For your results, could you elaborate why you picked "co-change" as the metric, and how to improve software tools from the "co-change" predictions? Thanks, Zhe On Mon, Dec 14, 2015 at 3:01 PM, Igor Wiese <igor.wi...@gmail.com> wrote: > Hi, Hadoop Community. > > My name is Igor Wiese, phd Student from Brazil. I sent an email a week > ago about my research. We received some visit to inspect the results > but any feedback was provided. > > I am investigating two important questions: What makes two files > change together? Can we predict when they are going to co-change > again? > > I've tried to investigate this question on the Hadoop project. I've > collected data from issue reports, discussions and commits and using > some machine learning techniques to build a prediction model. > > > I collected a total of 950 commits in which a pair of files changed > together and could correctly predict 47% commits. These were the most > useful information for predicting co-changes of files: > > - sum of number of lines of code added, modified and removed, > > - number of words used to describe and discuss the issues, > > - median value of closeness, a social network measure obtained from > issue comments, > > - median value of constraint, a social network measure obtained from > issue comments, and > > - median value of hierarchy, a social network measure obtained from > issue comments. > > To illustrate, consider the following example from our analysis. For > release 0.22, the files "/ipc/Client.java" and > "security/SecurityUtil.java" changed together in 3 commits. In another > 1 commit, only the first file changed, but not the second. Collecting > contextual information for each commit made to first file in the > previous release, we were able to predict 2 commits in which both > files changed together in release 0.22, and we only issued 1 wrong > prediction. For this pair of files, the most important contextual > information were the social network metrics (density, hierarchy, > efficiency) obtained from issue comments. > > > - Do these results surprise you? Can you think in any explanation for > the results? > > - Do you think that our rate of prediction is good enough to be used > for building tool support for the software community? > > - Do you have any suggestion on what can be done to improve the change > recommendation? > > You can visit our webpage to inspect the results in details: > http://flosscoach.com/index.php/17-cochanges/70-hadoop > > All the best, > Igor Wiese > > Phd Candidate > > -- > ================================= > Igor Scaliante Wiese > PhD Candidate - Computer Science @ IME/USP > Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná >