On 01/03/2014 12:09 PM, James E. Blair wrote:
Sean Dague <s...@dague.net> writes:
So my feeling is we should move away from the point graphs we have,
and present these as weekly and daily failure rates (with graphs and
error bars). And slice those per job. My suggestion is that we do the
actual visualization with matplotlib because it's super easy to output
that from pandas data sets.
I am very excited about this and everything above it!
= Take over of /recheck =
There is still a bunch of useful data coming in on "recheck bug ####"
data which hasn't been curated into ER queries. I think the right
thing to do is treat these as a work queue of bugs we should be
building patterns out of (or completely invalidating). I've got a
preliminary gerrit bulk query piece of code that does this, which
would remove the need of the daemon the way that's currently
happening. The gerrit queries are a little long right now, but I think
if we are only doing this on hourly cron, the additional load will be
negligible.
I think this is fine and am all for reducing complexity, but consider
this alternative: over the break, I moved both components of
elastic-recheck onto a new server (status.openstack.org). Since they
are now co-located, you could have the component of e-r that watches the
stream to provide responses to gerrit also note recheck actions. You
could stick the data in a file, memcache, trove database, etc, and the
status page could display that "work queue". No extra daemons required.
So I've got the bulk query written. Which means we could have this by
the end of next week with the approach I've got. I think that handling
the rest of this is an optimization.
I think the main user-visible aspect of this decision is the delay
before unprocessed bugs are made visible. If a bug starts affecting a
number of jobs, it might be nice to see what bug numbers people are
using for rechecks without waiting for the next cron run.
So my experience is that most rechecks happen > 1 hr after a patch
fails. And the people that are sitting on patches for bugs that have
never been seen before find their way to IRC.
The current state of the world is not all roses and unicorns. The
recheck daemon has died, and not been noticed that it was dead for
*weeks*. So a guarantee that we are only 1 hr delayed would actually be
on average better than the delays we've seen over the last six months of
following the event stream.
And again, we can optimize that over time.
I also think that caching should probably actually happen in gerritlib
itself. There is a concern that too many things are hitting gerrit, and
the result is that everyone is implementing their own client side
caching to try to be nice. (like the pickles in Russell's review stats
programs). This seems like the wrong place to do be doing it.
But, part of the reason for this email was to sort these sorts of issues
out, so let me know if you think the caching issue is an architectural
blocker.
Because if we're generally agreed on the architecture forward and are
just reviewing for correctness, the code can move fast, and we can
actually have ER 1.0 by the end of the month. Architecture review in
gerrit is where we grind to a halt.
-Sean
--
Sean Dague
Samsung Research America
s...@dague.net / sean.da...@samsung.com
http://dague.net
_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev