Everything sounds good!
On Mon, Jan 6, 2014 at 6:52 PM, Sean Dague <s...@dague.net> wrote: > On 01/06/2014 07:04 PM, Joe Gordon wrote: > >> Overall this looks really good, and very spot on. >> >> >> On Thu, Jan 2, 2014 at 6:29 PM, Sean Dague <s...@dague.net >> <mailto:s...@dague.net>> wrote: >> >> A lot of elastic recheck this fall has been based on the ad hoc >> needs of the moment, in between diving down into the race bugs that >> were uncovered by it. This week away from it all helped provide a >> little perspective on what I think we need to do to call it *done* >> (i.e. something akin to a 1.0 even though we are CDing it). >> >> Here is my current thinking on the next major things that should >> happen. Opinions welcomed. >> >> (These are roughly in implementation order based on urgency) >> >> = Split of web UI = >> >> The elastic recheck page is becoming a mismash of what was needed at >> the time. I think what we really have emerging is: >> * Overall Gate Health >> * Known (to ER) Bugs >> * Unknown (to ER) Bugs - more below >> >> I think the landing page should be Know Bugs, as that's where we >> want both bug hunters to go to prioritize things, as well as where >> people looking for known bugs should start. >> >> I think the overall Gate Health graphs should move to the zuul >> status page. Possibly as part of the collection of graphs at the >> bottom. >> >> We should have a secondary page (maybe column?) of the >> un-fingerprinted recheck bugs, largely to use as candidates for >> fingerprinting. This will let us eventually take over /recheck. >> >> >> I think it would be cool to collect the list of unclassified failures >> (not by recheck bug), so we can see how many (and what percentage) need >> to be classified. This isn't gate health but more of e-r health or >> something like that. >> > > Agreed. I've got the percentage in check_success today, but I agree that > every gate job that fails that we don't have a fingerprint should be listed > somewhere we can work through them. > > >> = Data Analysis / Graphs = >> >> I spent a bunch of time playing with pandas over break >> (http://dague.net/2013/12/30/__ipython-notebook-experiments/ >> <http://dague.net/2013/12/30/ipython-notebook-experiments/>)__, it's >> >> kind of awesome. It also made me rethink our approach to handling >> the data. >> >> I think the rolling average approach we were taking is more precise >> than accurate. As these are statistical events they really need >> error bars. Because when we have a quiet night, and 1 job fails at >> 6am in the morning, the 100% failure rate it reflects in grenade >> needs to be quantified that it was 1 of 1, not 50 of 50. >> >> >> So my feeling is we should move away from the point graphs we have, >> and present these as weekly and daily failure rates (with graphs and >> error bars). And slice those per job. My suggestion is that we do >> the actual visualization with matplotlib because it's super easy to >> output that from pandas data sets. >> >> >> The one thing that the current graph does, that weekly and daily failure >> rates don't show, is a sudden spike in one of the lines. If you stare >> at the current graphs for long enough and can read through the noise, >> you can see when the gate collectively crashes or if just the neutron >> related gates start failing. So I think one more graph is needed. >> > > The point of the visualizations is to make sense to people that don't > understand all the data, especially core members of various teams that are > trying to figure out "if I attack 1 bug right now, what's the biggest bang > for my buck." > > Yes, that is one of the big uses for a visualization. the one I had in mind was being able to see if a new unclassified bug appeared. > > Basically we'll be mining Elastic Search -> Pandas TimeSeries -> >> transforms and analysis -> output tables and graphs. This is >> different enough from our current jquery graphing that I want to get >> ACKs before doing a bunch of work here and finding out people don't >> like it in reviews. >> >> Also in this process upgrade the metadata that we provide for each >> of those bugs so it's a little more clear what you are looking at. >> >> >> For example? >> > > We should always be listing the bug title, not just the number. We should > also list what projects it's filed against. I've stared at this bugs as > much as anyone, and I still need to click through the top 4 to figure out > which one is the ssh bug. :) > > > = Take over of /recheck = >> >> There is still a bunch of useful data coming in on "recheck bug >> ####" data which hasn't been curated into ER queries. I think the >> right thing to do is treat these as a work queue of bugs we should >> be building patterns out of (or completely invalidating). I've got a >> preliminary gerrit bulk query piece of code that does this, which >> would remove the need of the daemon the way that's currently >> happening. The gerrit queries are a little long right now, but I >> think if we are only doing this on hourly cron, the additional load >> will be negligible. >> >> This would get us into a single view, which I think would be more >> informative than the one we currently have. >> >> >> treating /recheck as a work queue sounds great, but this needs a bit >> more fleshing out I think. >> >> I imagine the workflow as something like this: >> >> * State 1: Path author files bug saying 'gate broke, I didn't do it and >> don't know why it broke'. >> * State 2: Someone investigates the bug and determines if bug is valid >> and if its a duplicate or not. root cause still isn't known. >> * State 3: Someone writes a fingerprint for this bug and commits it to >> elastic-recheck. >> >> Assuming we agree on this general workflow, it would be nice if /recheck >> distinguished between bugs in states 1 and 2, and there is no need to >> list bugs in state 3 as e-r bot will automatically tell a developer when >> he hits it. >> > > Sure, that means policy on something in the bugs that can distinguish > between. I assume LP states. > > State 1 = new & invalid? > State 2 = confirmed / triaged? > > I think we can call that post 1.0 though, as we'll be adding details > beyond anything we have today. Yup, this sounds like post 1.0 to me too. > > > = Categorize all the jobs = >> >> We need a bit of refactoring to let us comment on all the jobs (not >> just tempest ones). Basically we assumed pep8 and docs don't fail in >> the gate at the beginning. Turns out they do, and are good >> indicators of infra / external factor bugs. They are a part of the >> story so we should put them in. >> >> >> Don't forget grenade >> > > Yep. That's part of all. :) I was just calling out the others as something > not originally on the list. > > > = Multi Line Fingerprints = >> >> We've definitely found bugs where we never had a really satisfying >> single line match, but we had some great matches if we could do >> multi line. >> >> We could do that in ER, however it will mean giving up logstash as >> our UI, because those queries can't be done in logstash. So in order >> to do this we'll really need to implement some tools - cli minimum, >> which will let us easily test a bug. A custom web UI might be in >> order as well, though that's going to be it's own chunk of work, >> that we'll need more volunteers for. >> >> This would put us in a place where we should have all the >> infrastructure to track 90% of the race conditions, and talk about >> them in certainty as 1%, 5%, 0.1% bugs. >> >> >> >> Horrah. multi line matches are two separate elasticSearch queries, where >> you match build_uuids. So to get the set of all hits of a multi line >> fingerprint you find the intersection between line_1 and line_2 where >> the key is build_uuid >> > > Yes. The biggest issue is tooling for making it easy for people to test > their queries. It's pretty unfriendly to tell people to do manual > correlation in ES. > > > -Sean > > -- > Sean Dague > Samsung Research America > s...@dague.net / sean.da...@samsung.com > http://dague.net > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev