On Thu, Jan 2, 2014 at 6:44 PM, Clark Boylan <clark.boy...@gmail.com> wrote: > On Thu, Jan 2, 2014 at 6:29 PM, Sean Dague <s...@dague.net> wrote: >> A lot of elastic recheck this fall has been based on the ad hoc needs of the >> moment, in between diving down into the race bugs that were uncovered by it. >> This week away from it all helped provide a little perspective on what I >> think we need to do to call it *done* (i.e. something akin to a 1.0 even >> though we are CDing it). >> >> Here is my current thinking on the next major things that should happen. >> Opinions welcomed. >> >> (These are roughly in implementation order based on urgency) >> >> = Split of web UI = >> >> The elastic recheck page is becoming a mismash of what was needed at the >> time. I think what we really have emerging is: >> * Overall Gate Health >> * Known (to ER) Bugs >> * Unknown (to ER) Bugs - more below >> >> I think the landing page should be Know Bugs, as that's where we want both >> bug hunters to go to prioritize things, as well as where people looking for >> known bugs should start. >> >> I think the overall Gate Health graphs should move to the zuul status page. >> Possibly as part of the collection of graphs at the bottom. >> >> We should have a secondary page (maybe column?) of the un-fingerprinted >> recheck bugs, largely to use as candidates for fingerprinting. This will let >> us eventually take over /recheck. >> >> = Data Analysis / Graphs = >> >> I spent a bunch of time playing with pandas over break >> (http://dague.net/2013/12/30/ipython-notebook-experiments/), it's kind of >> awesome. It also made me rethink our approach to handling the data. >> >> I think the rolling average approach we were taking is more precise than >> accurate. As these are statistical events they really need error bars. >> Because when we have a quiet night, and 1 job fails at 6am in the morning, >> the 100% failure rate it reflects in grenade needs to be quantified that it >> was 1 of 1, not 50 of 50. >> >> So my feeling is we should move away from the point graphs we have, and >> present these as weekly and daily failure rates (with graphs and error >> bars). And slice those per job. My suggestion is that we do the actual >> visualization with matplotlib because it's super easy to output that from >> pandas data sets. >> >> Basically we'll be mining Elastic Search -> Pandas TimeSeries -> transforms >> and analysis -> output tables and graphs. This is different enough from our >> current jquery graphing that I want to get ACKs before doing a bunch of work >> here and finding out people don't like it in reviews. >> >> Also in this process upgrade the metadata that we provide for each of those >> bugs so it's a little more clear what you are looking at. >> >> = Take over of /recheck = >> >> There is still a bunch of useful data coming in on "recheck bug ####" data >> which hasn't been curated into ER queries. I think the right thing to do is >> treat these as a work queue of bugs we should be building patterns out of >> (or completely invalidating). I've got a preliminary gerrit bulk query piece >> of code that does this, which would remove the need of the daemon the way >> that's currently happening. The gerrit queries are a little long right now, >> but I think if we are only doing this on hourly cron, the additional load >> will be negligible. >> >> This would get us into a single view, which I think would be more >> informative than the one we currently have. >> >> = Categorize all the jobs = >> >> We need a bit of refactoring to let us comment on all the jobs (not just >> tempest ones). Basically we assumed pep8 and docs don't fail in the gate at >> the beginning. Turns out they do, and are good indicators of infra / >> external factor bugs. They are a part of the story so we should put them in. >> >> = Multi Line Fingerprints = >> >> We've definitely found bugs where we never had a really satisfying single >> line match, but we had some great matches if we could do multi line. >> >> We could do that in ER, however it will mean giving up logstash as our UI, >> because those queries can't be done in logstash. So in order to do this >> we'll really need to implement some tools - cli minimum, which will let us >> easily test a bug. A custom web UI might be in order as well, though that's >> going to be it's own chunk of work, that we'll need more volunteers for. >> >> This would put us in a place where we should have all the infrastructure to >> track 90% of the race conditions, and talk about them in certainty as 1%, >> 5%, 0.1% bugs. >> >> -Sean >> >> -- >> Sean Dague >> Samsung Research America >> s...@dague.net / sean.da...@samsung.com >> http://dague.net >> >> _______________________________________________ >> OpenStack-dev mailing list >> OpenStack-dev@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > This is great stuff. Out of curiousity is doing the graphing with > pandas and ES vs graphite so that we can graph things in a more ad hoc > fashion? Also, for the dashboard, Kibana3 does a lot more stuff than > Kibana2 which we currently use. I have been meaning to get Kibana3 > running alongside Kibana2 and I think it may be able to do multi line > queries (I need to double check that but it has a lot more query and > graphing capability). I think Kibana3 is worth looking into as well > before we go too far down the road of custom UI. > > Clark
A quick check at http://demo.kibana.org/#/dashboard shows that while it supports multiple queries it just ORs all of the results together. So it doesn't quite do what we need. Clark _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev