Hi Igor,

It's an interesting direction to study tickets/commits in the Hadoop
community.

A research group from Univ. Wisconsin did a similar study on Linux file
systems and I found it quite insightful:
http://research.cs.wisc.edu/wind/Publications/fsstudy-tos14.pdf

For your results, could you elaborate why you picked "co-change" as the
metric, and how to improve software tools from the "co-change" predictions?

Thanks,
Zhe

On Mon, Dec 14, 2015 at 3:01 PM, Igor Wiese <igor.wi...@gmail.com> wrote:

> Hi, Hadoop Community.
>
> My name is Igor Wiese, phd Student from Brazil. I sent an email a week
> ago about my research. We received some visit to inspect the results
> but any feedback was provided.
>
> I am investigating two important questions: What makes two files
> change together? Can we predict when they are going to co-change
> again?
>
> I've tried to investigate this question on the Hadoop project. I've
> collected data from issue reports, discussions and commits and using
> some machine learning techniques to build a prediction model.
>
>
> I collected a total of 950 commits in which a pair of files changed
> together and could correctly predict 47% commits. These were the most
> useful information for predicting co-changes of files:
>
> - sum of number of lines of code added, modified and removed,
>
> - number of words used to describe and discuss the issues,
>
> - median value of closeness, a social network measure obtained from
> issue comments,
>
> - median value of constraint, a social network measure obtained from
> issue comments, and
>
> - median value of hierarchy, a social network measure obtained from
> issue comments.
>
> To illustrate, consider the following example from our analysis. For
> release 0.22, the files "/ipc/Client.java" and
> "security/SecurityUtil.java" changed together in 3 commits. In another
> 1 commit, only the first file changed, but not the second. Collecting
> contextual information for each commit made to first file in the
> previous release, we were able to predict 2 commits in which both
> files changed together in release 0.22, and we only issued 1 wrong
> prediction. For this pair of files, the most important contextual
> information were the social network metrics (density, hierarchy,
> efficiency) obtained from issue comments.
>
>
> - Do these results surprise you? Can you think in any explanation for
> the results?
>
> - Do you think that our rate of prediction is good enough to be used
> for building tool support for the software community?
>
> - Do you have any suggestion on what can be done to improve the change
> recommendation?
>
> You can visit our webpage to inspect the results in details:
> http://flosscoach.com/index.php/17-cochanges/70-hadoop
>
> All the best,
> Igor Wiese
>
> Phd Candidate
>
> --
> =================================
> Igor Scaliante Wiese
> PhD Candidate - Computer Science @ IME/USP
> Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná
>

Reply via email to