Are you handling new files as well, or the links between sets of files (or packages)? As an example, if a user creates a new API cmd, then he will update the "commands.properties" file. Another example, if a VO file is updated, then there will be a db migration file added as well. Cool work,
On Thu, Dec 10, 2015 at 9:21 AM, Igor Wiese <igor.wi...@gmail.com> wrote: > Hi Sebastien. > > We used only 141 commits because we needed data from the issues. As my > assumption is related to the contextual information from Issues and Social > aspects, we need to aggregate commits and Issues. > > First, I collected the issues from JIRA and then i tryed to aggregate the > commits that explicit made mentions to an issue collected. I only also used > closed issues to obtain the confidence that the code used to build my > models have been merged and checked by the community. > > That is the weak point of my approach. I need the past data from the > issues. Sometimes it is not available for past time. > It is in my plan to use also data from github to make the dataset more > complete. > > All the best, > > 2015-12-10 11:22 GMT-02:00 sebgoa <run...@gmail.com>: > > > > > On Dec 10, 2015, at 12:31 AM, Igor Wiese <igor.wi...@gmail.com> wrote: > > > > > Hi, Cloudstack Community. > > > > > > My name is Igor Wiese, phd Student from Brazil. In my research, I am > > > investigating two important questions: What makes two files change > > > together? Can we predict when they are going to co-change again? > > > > > > I've tried to investigate this question on the Cloudstack project. I've > > > collected data from issue reports, discussions and commits and using > some > > > machine learning techniques to build a prediction model. > > > > > > I collected a total of 141 commits in which a pair of files changed > > > together and could correctly predict 60% commits. > > > > > > Hi Igor, why 141 commits ? Is that the only commits you found with only a > > pair for changes ? > > > > My gut feeling is that you could check the entire history of the > > CloudStack repo (~5 years worth of data) and work on different type of > > tuples. > > > > 141 commits seems like a really small dataset. > > > > -Sebastien > > > > > These were the most > > > useful information for predicting co-changes of files: > > > > > > - sum of number of lines of code added, modified and removed, > > > > > > - number of words used to describe and discuss the issues, > > > > > > - number of comments in each issue, > > > > > > - median value of closeness, a social network measure obtained from > issue > > > comments, and > > > > > > - median value of constraint, a social network measure obtained from > > issue > > > comments. > > > > > > To illustrate, consider the following example from our analysis. For > > > release 4.4, the files "cloud/hypervisor/XenServerGuru.java" and > > > "cloud/hypervisor/guru/VMwareGuru.java " changed together in 3 commits. > > In > > > another 2 commits, only the first file changed, but not the second. > > > Collecting contextual information for each commit made to first file in > > the > > > previous release (4.3), we were able to predict all 3 commits in which > > both > > > files changed together in release 4.4, and we only issued 0 false > > > positives. For this pair of files, the most important contextual > > > information was the number of lines of code added, removed and modified > > in > > > each commit,the number of comments in each issue, and social network > > > measures (closeness, density, constraint, hierarchy) obtained from > issue > > > comments. > > > > > > - Do these results surprise you? Can you think in any explanation for > the > > > results? > > > > > > - Do you think that our rate of prediction is good enough to be used > for > > > building tool support for the software community? > > > > > > - Do you have any suggestion on what can be done to improve the change > > > recommendation? > > > > > > You can visit our webpage to inspect the results in details: > > > http://flosscoach.com/index.php/17-cochanges/67-cloudstack > > > > > > All the best, > > > Igor Wiese > > > Phd Candidate > > > > > > > -- > ================================= > Igor Scaliante Wiese > PhD Candidate - Computer Science @ IME/USP > Faculty in Dept. of Computing at Universidade Tecnológica Federal do Paraná >