Request for research feedback

Fulvio Valente Thu, 15 Sep 2011 09:24:20 -0700

Hi, I am a research intern at the University of Strathclyde who has been doing 
a summer research internship. I hope that this is not an inappropriate place to 
ask, but I am looking for participants willing to use and evaluate an 
application that was written as part of this internship. If you choose to 
participate, this should only take no more than an hour or two of your time at 
the very most, unless you wish to not use the analysis files provided.


My project is to experiment with ways of inferring and displaying 
socio-technical information about a software project from a version control 
repository by creating a prototype application. The application infers 
authorship information in the form of "closeness scores" between authors and 
the files they have worked on by traversing the history of a Mercurial 
repository. This is then displayed as a map of socio-technical dependencies 
between files and the authors that worked on them and between authors and the 
files they have worked on, which should allow for easy visual comparisons of 
closeness.

Some potential applications and use cases I can envision in the program's 
current form include discovering who the key figures are (i.e. who to talk to 
about a problem) or what the bus factor (how many people need to be "hit by a 
bus" before a project is in trouble because there's nobody left who understands 
it) is for a project or a subsystem within it. Perhaps a little more 
depressingly, this also maybe be used to highlight potential cases of poor 
productivity to be investigated.

The program itself, Jimmy (for lack of a better name), has a binary 
distribution which can be found at 
http://geeksoc.org/~fvalente/jimmy/jimmy_r1.zip and only requires Java 7 and 
Mercurial (other libraries are included), getting you started quickly. The 
source is available at https://bitbucket.org/fvalente/jimmy and if you wish to 
build it yourself, it depends on Guava r09 and JUNG 2.0.1 (api, algorithms, 
graph-impl, visualization) which itself depends on Apache Commons 
collections-generic 4.0.1.

To perform a basic analysis of a project, you can open a directory that's a 
Mercurial repository and it will just look at the list of commits and the files 
that changed, adding 1 to a score each time an author touches a file, which 
should only take a minute or two, even for large projects. If you have more 
time, you can do the more expensive diff stats analysis which compares the size 
of each diff with the average diff size of the project, excluding empty diffs 
and binary changes. Unfortunately, the diff stats analysis is very slow due to 
retrieving each diff requiring the spawning of a hg subprocess (for reference, 
my 4 year old quad-core machine can do only ~10,000 commits per hour). I don't 
have a progress UI yet, but progress status is sent to stdout when doing a diff 
stats analysis. Once analysis is complete, you can save the results to review 
later by using the open analysis results option.

To navigate the results you can switch to viewing a particular file or author 
of interest from the file and author lists on the left. For files, this will 
display that file as a node in the centre with the authors that have been 
involved with it as orbiting nodes, with the connecting lines' length, 
thickness and colour shortening, thickening and approaching red respectively as 
the closeness score between that author and the file increases. For authors, it 
is the same except the files they have worked on will be the orbiting nodes. 
You can also directly navigate to having a display based on an orbiting node by 
clicking it in the display rather than searching through the file or author 
lists. The display can be zoomed by using the scroll wheel and can be 
translated with the scroll bars or by dragging on an area not occupied by a 
node.

What I would like is for you to please run Jimmy on one or more Mercurial 
repositories of your choice and to give some feedback on it. Some questions I'd 
particularly like answered are:

* Do the closeness scores it produces match with your perception of the 
relationships between people and code in the project? (e.g. if you're looking 
at a particular file and some authors involved in it are shown as closer than 
others, is this the result you would have expected from a perfect version of 
Jimmy?)
* Does the visualisation of the scores substantially improve your ability to 
draw conclusions from the data compared to just reading a saved analysis (which 
is just plaintext)?
* If, hypothetically, you had no prior knowledge about the project, would using 
it help you to discover the key figures (e.g. maintainer, BDFL) behind the 
project or any of its subsystems? (Alternatively, do such people correctly show 
up as having touched a wider variety of files and with closer relations to them 
than other people?)
* If you were a manager would you be able to use it to discover potential 
productivity issues that you would then investigate further?

To help save you time from having to do a full analysis of a project, I have 
supplied analysis files from 3 open-source projects which you can open with the 
"Open analysis results" option:

* cpython: http://geeksoc.org/~fvalente/jimmy/cpython.txt
* libSDL: http://geeksoc.org/~fvalente/jimmy/libsdl.txt
* Go: http://geeksoc.org/~fvalente/jimmy/golang.txt

Some general suggestions on whether and why the current ways of inferring 
closeness scores and visualising that data are flawed would also be greatly 
appreciated, as well as potential avenues to explore for improving them. 
Suggestions I've already received include:

* Being able to collapse files into folders or subsystem groups to make larger 
projects more navigable, perhaps with autoexpansion when zooming the display. 
In its current form, Jimmy produces disappointing/funny results when you want 
to see a diagram for, say, a large project's maintainer.
* Being able to mine data from a subset of the repository (time range, revision 
range, include/exclude files/directories, etc.)
* Reducing the closeness score contributions of multiple commits made in quick 
succession, or another method of mitigating the bias in favour of fine-grained 
committers
* Reducing the closeness score contributions of older commits
* Interfacing with Mercurial via the recently introduced command server API, 
which should hopefully make performance non-abysmal
* Support for more version control systems. Git would top this list
* Perhaps the ability to see a timeline for the project and how closeness 
changes over time

Responses can be made privately to me, if you wish. For the purposes of my 
report, I will also anonymise all responses received in line with ethical best 
practices. Thank you for reading.
-- 
http://mail.python.org/mailman/listinfo/python-list

Request for research feedback

Reply via email to