Re: [openstack-dev] Gate breakage process - Let's fix! (related but not specific to neutron)

Sean Dague Tue, 03 Sep 2013 13:05:26 -0700

On 08/17/2013 12:26 AM, Clint Byrum wrote:

Excerpts from Maru Newby's message of 2013-08-16 16:42:23 -0700:


On Aug 16, 2013, at 11:44 AM, Clint Byrum <cl...@fewbar.com> wrote:

Excerpts from Maru Newby's message of 2013-08-16 11:25:07 -0700:

Neutron has been in and out of the gate for the better part of the past month, 
and it didn't slow the pace of development one bit.  Most Neutron developers 
kept on working as if nothing was wrong, blithely merging changes with no 
guarantees that they weren't introducing new breakage.  New bugs were indeed 
merged, greatly increasing the time and effort required to get Neutron back in 
the gate.  I don't think this is sustainable, and I'd like to make a suggestion 
for how to minimize the impact of gate breakage.

For the record, I don't think consistent gate breakage in one project should be 
allowed to hold up the development of other projects.  The current approach of 
skipping tests or otherwise making a given job non-voting for innocent projects 
should continue.  It is arguably worth taking the risk of relaxing gating for 
those innocent projects rather than halting development unnecessarily.

However, I don't think it is a good idea to relax a broken gate for the 
offending project.  So if a broken job/test is clearly Neutron related, it 
should continue to gate Neutron, effectively preventing merges until the 
problem is fixed.  This would both raise the visibility of breakage beyond the 
person responsible for fixing it, and prevent additional breakage from slipping 
past were the gating to be relaxed.

Thoughts?


I think this is a cultural problem related to the code review discussing
from earlier in the week.

We are not looking at finding a defect and reverting as a good thing where
high fives should be shared all around. Instead, "you broke the gate"
seems to mean "you are a bad developer". I have been a bad actor here too,
getting frustrated with the gate-breaker and saying the wrong thing.

The problem really is "you _broke_ the gate". It should be "the gate has
found a defect, hooray!". It doesn't matter what causes the gate to stop,
it is _always_ a defect. Now, it is possible the defect is in tempest,
or jenkins, or HP/Rackspace's clouds where the tests run. But it is
always a defect that what worked before, does not work now.

Defects are to be expected. None of us can write perfect code. We should
be happy to revert commits and go forward with an enabled gate while
the team responsible for the commit gathers information and works to
correct the issue.


You're preaching to the choir, and I suspect that anyone with an interest in 
software quality is likely to prefer problem solving to finger pointing.  
However, my intent with this thread was not to promote more constructive 
thinking about defect detection.  Rather, I was hoping to communicate a flaw in 
the existing process and seek consensus on how that process could best be 
modified to minimize the cost of resolving gate breakage.


I believe that the process is a symptom of the culture. If we were
more eager to revert/discover/fix/re-submit on failure, we wouldn't
be turning off the gate for things. Instead we cling to whatever has
had the requisite "+2/approval" as if passing the stringent review has
imparted our code with magical powers which will eventually morph into
a passing gate.

In a perfect world we could make our CI infrastructure bisect the failures
to try and isolate the commits that did them so at least anybody can see
the commit that did the damage and revert it quickly. Realistically, most
of the time we remove from the gate because the failures are intermittent
and take _forever_ to discover, so that may not even be possible.

I am suggesting that we all change our perspective and embrace "revert
this immediately" as "thank you for finding that defect" not "you jerk
why did you revert my code". It may still be hard to find which commit
to revert, but at least one can spend that time with the idea that they
will be rewarded, rather than punished, for their efforts.

Late on the thread (was out), but an important clarification here is torealize that most gate breaks aren't 100% fails, they are 5% or 2% or 1%(or less) fails.

For a patch to land in Nova it's got to run through pass tempest 3 timesin a gate run (and it probably won't have been pushed there until itpassed in the check run). Which means it's got to work at least 90% ofthe time.

Neutron, because it only runs 1 configuration of tempest, and only insmoke mode, means that a patch that only works 50% of the time caneasily land.

Bisection of patches to the failure point only works if you have abinary test for success and failure. In the Tempest gate, with a realdevstack environment, running 20 services asynchronously on variabilityperformance guests in a cloud, if we had a consistent binary test we'dnever have landed the code in the first place.


So there isn't an automatic bisection solution.

        -Sean

--
Sean Dague
http://dague.net

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Gate breakage process - Let's fix! (related but not specific to neutron)

Reply via email to