Re: [openstack-dev] [elastic-recheck] Thoughts on next steps

Sean Dague Sat, 04 Jan 2014 05:21:37 -0800

On 01/03/2014 12:09 PM, James E. Blair wrote:

Sean Dague <s...@dague.net> writes:

So my feeling is we should move away from the point graphs we have,
and present these as weekly and daily failure rates (with graphs and
error bars). And slice those per job. My suggestion is that we do the
actual visualization with matplotlib because it's super easy to output
that from pandas data sets.


I am very excited about this and everything above it!

= Take over of /recheck =

There is still a bunch of useful data coming in on "recheck bug ####"
data which hasn't been curated into ER queries. I think the right
thing to do is treat these as a work queue of bugs we should be
building patterns out of (or completely invalidating). I've got a
preliminary gerrit bulk query piece of code that does this, which
would remove the need of the daemon the way that's currently
happening. The gerrit queries are a little long right now, but I think
if we are only doing this on hourly cron, the additional load will be
negligible.


I think this is fine and am all for reducing complexity, but consider
this alternative: over the break, I moved both components of
elastic-recheck onto a new server (status.openstack.org).  Since they
are now co-located, you could have the component of e-r that watches the
stream to provide responses to gerrit also note recheck actions.  You
could stick the data in a file, memcache, trove database, etc, and the
status page could display that "work queue".  No extra daemons required.

So I've got the bulk query written. Which means we could have this bythe end of next week with the approach I've got. I think that handlingthe rest of this is an optimization.

I think the main user-visible aspect of this decision is the delay
before unprocessed bugs are made visible.  If a bug starts affecting a
number of jobs, it might be nice to see what bug numbers people are
using for rechecks without waiting for the next cron run.

So my experience is that most rechecks happen > 1 hr after a patchfails. And the people that are sitting on patches for bugs that havenever been seen before find their way to IRC.

The current state of the world is not all roses and unicorns. Therecheck daemon has died, and not been noticed that it was dead for*weeks*. So a guarantee that we are only 1 hr delayed would actually beon average better than the delays we've seen over the last six months offollowing the event stream.


And again, we can optimize that over time.

I also think that caching should probably actually happen in gerritlibitself. There is a concern that too many things are hitting gerrit, andthe result is that everyone is implementing their own client sidecaching to try to be nice. (like the pickles in Russell's review statsprograms). This seems like the wrong place to do be doing it.

But, part of the reason for this email was to sort these sorts of issuesout, so let me know if you think the caching issue is an architecturalblocker.

Because if we're generally agreed on the architecture forward and arejust reviewing for correctness, the code can move fast, and we canactually have ER 1.0 by the end of the month. Architecture review ingerrit is where we grind to a halt.


        -Sean

--
Sean Dague
Samsung Research America
s...@dague.net / sean.da...@samsung.com
http://dague.net

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [elastic-recheck] Thoughts on next steps

Reply via email to