So, we're currently 70 deep in the gate, top of queue went in > 40 hrs ago (probably closer to 50 or 60, but we only have enqueue time going back to the zuul restart).
I have a couple of ideas about things we should do based on what I've seen in the gate during this wedge. = Remove reverify entirely = Core reviewers can trigger a requeue with +A state changes. Reverify right now is exceptional dangerous in that it lets *any* user put something back in the gate, even if it can't pass. There are a ton of users that believe they are being helpful in doing so, and making things a ton worse. stable/havana changes being a prime instance. If we were being prolog tricky, I'd actually like to make Jenkins -2 changes need positive run on it before it could be reenqueued. For instance, I saw a swift core developer run "reverify bug 123456789" again on a change that couldn't pass. While -2s are mostly races at this point, the team of people that are choosing to ignore them are not staying up on what's going on in the queue enough to really know whether or not trying again is ok. = Early Fail Detection = With the tempest run now coming in north of an hour, I think we need to bump up the priority of signally up to jenkins that we're a failure the first time we see that in the subunit stream. If we fail at 30 minutes, waiting for 60 until a reset is just adding far more delay. I'm not really sure how we get started on this one, but I think we should. = Pep8 kick out of check = I think on the Check Queue we should pep8 first, and not run other tests until that passes (this reverses a previous opinion I had). We're now starving nodepool. Preventing taking 5 nodepool nodes on patches that don't pep8 would be handy. When Dan pushes a 15 patch change that fixes nova-network, and patch 4 has a pep8 error, we thrash a bunch. = More aggressive kick out by zuul = We have issues where projects have racing unit tests, which they've not prioritized fixing. So those create wrecking balls in the gate. Previously we've been opposed to kicking those out based on the theory the patch ahead could be the problem (which I've actually never seen). However.... this is actually fixable. We could see if there is anything ahead of it in zuul that runs the same tests. If not, then it's not possible that something ahead of it could fix it. This is based on the same logic zuul uses to build the queue in the first place. This would shed the wrecking balls earlier. = Periodic recheck on old changes = I think Michael Still said he was working on this one. Certain projects, like Glance and Keystone, tend to approve things with really stale test results (> 1 month old). These fail, and then tumble. They are a be source of the wrecking balls. Tests results > 1 week are clearly irrelevant. For something like nova, > 3 days can be problematic. I'm sure there are some other ideas, but I wanted to dump this out while it was fresh in my brain. -Sean -- Sean Dague Samsung Research America s...@dague.net / sean.da...@samsung.com http://dague.net
signature.asc
Description: OpenPGP digital signature
_______________________________________________ OpenStack-Infra mailing list OpenStack-Infra@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra