time

Sean Dague Fri, 18 Oct 2013 05:37:50 -0700

On 10/17/2013 05:34 PM, Stefano Maffulli wrote:

hello folks


first of all: congratulations to all developers, testers, users,
translators, tech writers for the new release: Havana is out of the gate
with impressive numbers.

Speaking of numbers, a lot of you have noticed mistakes in the reported
numbers, from misspelling of names to missing/wrong company
affiliations. With my apologies for the mistakes comes an explanation of
where I see things fail and a suggestion on how to fix this for the future.

Currently there are three places where statistics about the project are
released:

  - OpenStack Activity Board http://activity.openstack.org/
  - gitdm http://git.openstack.org/cgit/openstack-infra/gitdm/
  - Stackalytics http://git.openstack.org/cgit/stackforge/stackalytics/

Activity Board is actually made of two pieces: the Dash and Insights.
Insights pulls straight from the OpenStack Foundation Members db
http://www.openstack.org/community/members/, so what you see in personal
pages like

http://activity.openstack.org/data/plugins/zfacts/view.action?instance=Person,person3986c85a-b9af-4686-8c7b-45525f62e396

is exactly what is written on Robert's personal profile
http://www.openstack.org/community/members/profile/3619 (these
confluence pages are updated daily).

The data about companies on the Dash are the result of semi-automatic
processing and cleanup of the data from OpenStack Foundation Members db.
The cleanup is necessary because a) one can't always rely on people
spelling correctly the name of their company b) the Profile pages lack
the UI to properly track the history of affiliation [1]. Here is what
the Dash looks like for Canonical:

http://activity.openstack.org/dash/releases/company.html?company=Canonical

gitdm and Stackalytics take their developer/company/time tuples from
files maintained by developers themselves compensated by heuristics to
'guess' affiliations from things like email addresses in the commit logs.

Four sources of data for this reporting is bad and not sustainable.

Since it seems commonly accepted that all developers need to be members
of the Foundation, and that Foundation members need to state their
affiliation when they join and keep such data current when it changes, I
think the Foundation is in a good place to provide the authoritative
data for all projects to use.

I'm not sure it is well understoond that all members have to join thefoundation. We don't make that a requirement on someone slinging apatch. It would be nice to know what percentage of ATCs actually arefoundation members at the moment (presumably that number is easy togenerate?)

The thing is, the Foundation data currently seems to be the leastaccurate of all the data sets. Also, the lack of affiliation over timeis really a problem for this project, especially if one of the drivingfactors for so much interest in statistics comes from organizationswanting to ensure contributions by their employees get counted. Asignificant percentage of top contributors to OpenStack have notremained at a single employer over their duration to contributing toOpenStack, and I expect that to be the norm as the project ages.

Also, both gitdm and stackalytics have active open developer communities(and they are open source all the way down, don't need non opencomponents to run), so again, I'm not sure why defaulting to the leastopen platform makes any sense.

Member affiliation in the Foundation database can also only be fixed bythe individual. In the other tools people in the know can fix it. Itmeans we get a wikipedia effect in getting the data more accurate, asyou can fix any issue you see, not just your own.

If the foundation member database was it's own thing, had a REST API tobulk fetch, and supported temporal associations, and let others proposeupdates to people's affiliation, then it would be an option. But rightnow it seems very far from being useful, and is probably the least, notmost, accurate version of the world.

We can make things easier by making the personal profile pages more
useful so people login more often and improve quality of data. Fixing
the known shortcomings mentioned above is one step. Furthermore, we're
working to develop an OpenID provider based on the Members DB that will
be used across all our web properties (from gerrit to the upcoming
groups.openstack.org, etc) so those profile will be used for more than
just for the initial signup to be a member [2].

Since nobody can rely on user input we will still have to 'cleanup' the
data as it comes in from the Members DB in order to create a 'Master
Data Record' that we can export for all to consume. Here things get a
bit fuzzy because currently the Members DB has an API that is not
designed to be securely consumed publicly[3].

What I think we can do is to have a periodic job pulling the full list
of members and their stated affiliation, and run on that an
automatic/manual cleanup/sanitizing job that creates files/tables ready
to be consumed by all projects.

What do you think? I'm interested in gathering more ideas and lay down a
plan to fix this issue.

thanks,
stef


[1]  To improve problem A the system suggests proper spelling when you
start typing. For problem B there is a fix coming to the site.

[2] I'll send more details about this project soon
https://blueprints.launchpad.net/openstack-ci/+spec/sso-openid-provider

[3] The Members DB is tightly connected to the web site openstack.org.
There is an effort to move the whole site under openstack-infra/ so this
pain poing will be removed soon, hopefully.

PS did you look at the numbers?
http://www.openstack.org/software/havana/
http://blog.bitergia.com/2013/10/17/the-openstack-havana-release/



--
Sean Dague
http://dague.net

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Metrics] Improving the data about contributor/affiliation/time

Reply via email to