On 01/03/2014 12:09 PM, James E. Blair wrote:
Sean Dague <s...@dague.net> writes:

So my feeling is we should move away from the point graphs we have,
and present these as weekly and daily failure rates (with graphs and
error bars). And slice those per job. My suggestion is that we do the
actual visualization with matplotlib because it's super easy to output
that from pandas data sets.

I am very excited about this and everything above it!

= Take over of /recheck =

There is still a bunch of useful data coming in on "recheck bug ####"
data which hasn't been curated into ER queries. I think the right
thing to do is treat these as a work queue of bugs we should be
building patterns out of (or completely invalidating). I've got a
preliminary gerrit bulk query piece of code that does this, which
would remove the need of the daemon the way that's currently
happening. The gerrit queries are a little long right now, but I think
if we are only doing this on hourly cron, the additional load will be
negligible.

I think this is fine and am all for reducing complexity, but consider
this alternative: over the break, I moved both components of
elastic-recheck onto a new server (status.openstack.org).  Since they
are now co-located, you could have the component of e-r that watches the
stream to provide responses to gerrit also note recheck actions.  You
could stick the data in a file, memcache, trove database, etc, and the
status page could display that "work queue".  No extra daemons required.

So I've got the bulk query written. Which means we could have this by the end of next week with the approach I've got. I think that handling the rest of this is an optimization.

I think the main user-visible aspect of this decision is the delay
before unprocessed bugs are made visible.  If a bug starts affecting a
number of jobs, it might be nice to see what bug numbers people are
using for rechecks without waiting for the next cron run.

So my experience is that most rechecks happen > 1 hr after a patch fails. And the people that are sitting on patches for bugs that have never been seen before find their way to IRC.

The current state of the world is not all roses and unicorns. The recheck daemon has died, and not been noticed that it was dead for *weeks*. So a guarantee that we are only 1 hr delayed would actually be on average better than the delays we've seen over the last six months of following the event stream.

And again, we can optimize that over time.

I also think that caching should probably actually happen in gerritlib itself. There is a concern that too many things are hitting gerrit, and the result is that everyone is implementing their own client side caching to try to be nice. (like the pickles in Russell's review stats programs). This seems like the wrong place to do be doing it.

But, part of the reason for this email was to sort these sorts of issues out, so let me know if you think the caching issue is an architectural blocker.

Because if we're generally agreed on the architecture forward and are just reviewing for correctness, the code can move fast, and we can actually have ER 1.0 by the end of the month. Architecture review in gerrit is where we grind to a halt.

        -Sean

--
Sean Dague
Samsung Research America
s...@dague.net / sean.da...@samsung.com
http://dague.net

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to