On Sun, Jan 19, 2014 at 7:01 AM, Monty Taylor <mord...@inaugust.com> wrote:
> On 01/19/2014 05:38 AM, Sean Dague wrote: > >> So, we're currently 70 deep in the gate, top of queue went in > 40 hrs >> ago (probably closer to 50 or 60, but we only have enqueue time going >> back to the zuul restart). >> >> I have a couple of ideas about things we should do based on what I've >> seen in the gate during this wedge. >> >> = Remove reverify entirely = >> > > Yes. Screw it. In a deep queue like now, it's more generally harmful than > good. I agree with this one, but we should also try to educate the devs, because in the case you brought up below it was a core dev who didn't examine why his patch failed and if he couldn't do reverify bug, he could just do +A. > > > Core reviewers can trigger a requeue with +A state changes. Reverify >> right now is exceptional dangerous in that it lets *any* user put >> something back in the gate, even if it can't pass. There are a ton of >> users that believe they are being helpful in doing so, and making things >> a ton worse. stable/havana changes being a prime instance. >> >> If we were being prolog tricky, I'd actually like to make Jenkins -2 >> changes need positive run on it before it could be reenqueued. For >> instance, I saw a swift core developer run "reverify bug 123456789" >> again on a change that couldn't pass. While -2s are mostly races at this >> point, the team of people that are choosing to ignore them are not >> staying up on what's going on in the queue enough to really know whether >> or not trying again is ok. >> >> = Early Fail Detection = >> >> With the tempest run now coming in north of an hour, I think we need to >> bump up the priority of signally up to jenkins that we're a failure the >> first time we see that in the subunit stream. If we fail at 30 minutes, >> waiting for 60 until a reset is just adding far more delay. >> >> I'm not really sure how we get started on this one, but I think we should. >> > > This one I think will be helpful, but it also is the one that includes > that most deep development. Honestly, the chances of getting it done this > week are almost none. > > That said - I agree we should accelerate working on it. I have access to a > team of folks in India with both python and java backgrounds - if it would > be helpful and if we can break out work into, you know, assignable chunks, > let me know. > > > = Pep8 kick out of check = >> >> I think on the Check Queue we should pep8 first, and not run other tests >> until that passes (this reverses a previous opinion I had). We're now >> starving nodepool. Preventing taking 5 nodepool nodes on patches that >> don't pep8 would be handy. When Dan pushes a 15 patch change that fixes >> nova-network, and patch 4 has a pep8 error, we thrash a bunch. >> > > Agree. I think this might be one of those things that goes back and forth > on being a good or bad idea over time. I think now is a time when it's a > good idea. What about adding a pre-gate queue that makes sure pep8 and unit tests pass before adding a job to the gate (of course this would mean we would have to re-run pep8 and unit tests in the gate). Hopefully this would reduce the amount of gate thrashing incurred by a gate patch that fails one of these jobs. > > > = More aggressive kick out by zuul = >> >> We have issues where projects have racing unit tests, which they've not >> prioritized fixing. So those create wrecking balls in the gate. >> Previously we've been opposed to kicking those out based on the theory >> the patch ahead could be the problem (which I've actually never seen). >> >> However.... this is actually fixable. We could see if there is anything >> ahead of it in zuul that runs the same tests. If not, then it's not >> possible that something ahead of it could fix it. This is based on the >> same logic zuul uses to build the queue in the first place. >> >> This would shed the wrecking balls earlier. >> > > Interesting. How would zuul be able to investigate that? Do we need > zuul-subunit-consumption for this one too? > > > = Periodic recheck on old changes = >> >> I think Michael Still said he was working on this one. Certain projects, >> like Glance and Keystone, tend to approve things with really stale test >> results (> 1 month old). These fail, and then tumble. They are a be >> source of the wrecking balls. >> > > I believe he's got it working, actually. I think the real trick with this > - which I whole-heartedly approve of - is not making node starvation worse. > > Tests results > 1 week are clearly irrelevant. For something like nova, >> >>> 3 days can be problematic. >>> >> >> I'm sure there are some other ideas, but I wanted to dump this out while >> it was fresh in my brain. >> >> -Sean >> >> >> >> _______________________________________________ >> OpenStack-Infra mailing list >> OpenStack-Infra@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra >> >> > > _______________________________________________ > OpenStack-Infra mailing list > OpenStack-Infra@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra >
_______________________________________________ OpenStack-Infra mailing list OpenStack-Infra@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra