Re: [openstack-dev] [gate] The gate: a failure analysis

Ihar Hrachyshka Mon, 28 Jul 2014 01:38:28 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 28/07/14 08:52, Angus Lees wrote:
> On Mon, 21 Jul 2014 04:39:49 PM David Kranz wrote:
>> On 07/21/2014 04:13 PM, Jay Pipes wrote:
>>> On 07/21/2014 02:03 PM, Clint Byrum wrote:
>>>> Thanks Matthew for the analysis.
>>>> 
>>>> I think you missed something though.
>>>> 
>>>> Right now the frustration is that unrelated intermittent bugs
>>>> stop your presumably good change from getting in.
>>>> 
>>>> Without gating, the result would be that even more bugs, many
>>>> of them not intermittent at all, would get in. Right now, the
>>>> one random developer who has to hunt down the rechecks and do
>>>> them is inconvenienced. But without a gate, _every single_
>>>> developer will be inconvenienced until the fix is merged.
>>>> 
>>>> The false negative rate is _way_ too high. Nobody would
>>>> disagree there. However, adding more false negatives and
>>>> allowing more people to ignore the ones we already have,
>>>> seems like it would have the opposite effect: Now instead of
>>>> annoying the people who hit the random intermittent bugs, 
>>>> we'll be annoying _everybody_ as they hit the
>>>> non-intermittent ones.
>>> 
>>> +10
>> 
>> Right, but perhaps there is a middle ground. We must not allow
>> changes in that can't pass through the gate, but we can separate
>> the problems of constant rechecks using too many resources, and
>> of constant rechecks causing developer pain. If failures were
>> deterministic we would skip the failing tests until they were
>> fixed. Unfortunately many of the common failures can blow up any
>> test, or even the whole process. Following on what Sam said, what
>> if we automatically reran jobs that failed in a known way, and
>> disallowed "recheck/reverify no bug"? Developers would then have
>> to track down what bug caused a failure or file a new one. But 
>> they would have to do so much less frequently, and as more
>> common failures were catalogued it would become less and less
>> frequent.
>> 
>> Some might (reasonably) argue that this would be a bad thing
>> because it would reduce the incentive for people to fix bugs if
>> there were less pain being inflicted. But given how hard it is to
>> track down these race bugs, and that we as a community have no
>> way to force time to be spent on them, and that it does not
>> appear that these bugs are causing real systems to fall down
>> (only our gating process), perhaps something different should be
>> considered?
> 
> So to pick an example dear to my heart, I've been working on
> removing these gate failures: 
> http://logstash.openstack.org/#eyJzZWFyY2giOiJcIkxvY2sgd2FpdCB0aW1lb3V0IGV4Y2VlZGVkOyB0cnkgcmVzdGFydGluZyB0cmFuc2FjdGlvblwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA2NTI3OTA3NzkzfQ==
>
>  .. caused by a bad interaction between eventlet and our default
> choice of mysql driver.  It would also affect any real world
> deployment using mysql.
> 
> The problem has been identified and the fix proposed for almost a
> month now, but actually fixing the gate jobs is still no-where in
> sight.  The fix is (pretty much) as easy as a pip install and a
> slightly modified database connection string.


[And to highjack the thread even more] The fix Angus has kindly
referred to is:

- - spec: https://review.openstack.org/#/c/108355/
- - devstack: https://review.openstack.org/#/c/105209/ (plus several
tiny fixes in multiple projects to make sure the patch succeeds in db
migration).

> I look forward to a discussion of the meta-issues surrounding this,
> but it is not because no-one tracked down or fixed the bug :(
> 
> - Gus
> 
>> -David
>> 
>>> Best, -jay
>>> 
>>>> Excerpts from Matthew Booth's message of 2014-07-21 03:38:07
>>>> -0700:
>>>>> On Friday evening I had a dependent series of 5 changes all
>>>>> with approval waiting to be merged. These were all refactor
>>>>> changes in the VMware driver. The changes were:
>>>>> 
>>>>> * VMware: DatastorePath join() and __eq__() 
>>>>> https://review.openstack.org/#/c/103949/
>>>>> 
>>>>> * VMware: use datastore classes
>>>>> get_allowed_datastores/_sub_folder 
>>>>> https://review.openstack.org/#/c/103950/
>>>>> 
>>>>> * VMware: use datastore classes in file_move/delete/exists,
>>>>> mkdir https://review.openstack.org/#/c/103951/
>>>>> 
>>>>> * VMware: Trivial indentation cleanups in vmops 
>>>>> https://review.openstack.org/#/c/104149/
>>>>> 
>>>>> * VMware: Convert vmops to use instance as an object 
>>>>> https://review.openstack.org/#/c/104144/
>>>>> 
>>>>> The last change merged this morning.
>>>>> 
>>>>> In order to merge these changes, over the weekend I
>>>>> manually submitted:
>>>>> 
>>>>> * 35 rechecks due to false negatives, an average of 7 per
>>>>> change * 19 resubmissions after a change passed, but its
>>>>> dependency did not
>>>>> 
>>>>> Other interesting numbers:
>>>>> 
>>>>> * 16 unique bugs * An 87% false negative rate * 0 bugs
>>>>> found in the change under test
>>>>> 
>>>>> Because we don't fail fast, that is an average of at least
>>>>> 7.3 hours in the gate. Much more in fact, because some runs
>>>>> fail on the second pass, not the first. Because we don't
>>>>> resubmit automatically, that is only if a developer is
>>>>> actively monitoring the process continuously, and resubmits
>>>>> immediately on failure. In practise this is much longer, 
>>>>> because sometimes we have to sleep.
>>>>> 
>>>>> All of the above numbers are counted from the change
>>>>> receiving an approval +2 until final merging. There were
>>>>> far more failures than this during the approval process.
>>>>> 
>>>>> Why do we test individual changes in the gate? The purpose
>>>>> is to find errors *in the change under test*. By the above
>>>>> numbers, it has failed to achieve this at least 16 times
>>>>> previously.
>>>>> 
>>>>> Probability of finding a bug in the change under test:
>>>>> Small Cost of testing:
>>>>> High Opportunity cost of slowing development:
>>>>> High
>>>>> 
>>>>> and for comparison:
>>>>> 
>>>>> Cost of reverting rare false positives:
>>>>> Small
>>>>> 
>>>>> The current process expends a lot of resources, and does
>>>>> not achieve its goal of finding bugs *in the changes under
>>>>> test*. In addition to using a lot of technical resources,
>>>>> it also prevents good change from making its way into the
>>>>> project and, not unimportantly, saps the will to live of 
>>>>> its victims. The cost of the process is overwhelmingly
>>>>> greater than its benefits. The gate process as it stands is
>>>>> a significant net negative to the project.
>>>>> 
>>>>> Does this mean that it is worthless to run these tests?
>>>>> Absolutely not! These tests are vital to highlight a severe
>>>>> quality deficiency in OpenStack. Not addressing this is,
>>>>> imho, an existential risk to the project. However, the
>>>>> current approach is to pick contributors from the community
>>>>> at random and hold them personally responsible for project 
>>>>> bugs selected at random. Not only has this approach failed,
>>>>> it is impractical, unreasonable, and poisonous to the
>>>>> community at large. It is also unrelated to the purpose of
>>>>> gate testing, which is to find bugs *in the changes under
>>>>> test*.
>>>>> 
>>>>> I would like to make the radical proposal that we stop
>>>>> gating on CI failures. We will continue to run them on
>>>>> every change, but only after the change has been
>>>>> successfully merged.
>>>>> 
>>>>> Benefits: * Without rechecks, the gate will use 8 times
>>>>> fewer resources. * Log analysis is still available to
>>>>> indicate the emergence of races. * Fixes can be merged
>>>>> quicker. * Vastly less developer time spent monitoring gate
>>>>> failures.
>>>>> 
>>>>> Costs: * A rare class of merge bug will make it into
>>>>> master.
>>>>> 
>>>>> Note that the benefits above will also offset the cost of
>>>>> resolving this rare class of merge bug.
>>>>> 
>>>>> Of course, we still have the problem of finding resources
>>>>> to monitor and fix CI failures. An additional benefit of
>>>>> not gating on CI will be that we can no longer pretend that
>>>>> picking developers for project-affecting bugs by lottery is
>>>>> likely to achieve results. As a project we need to 
>>>>> understand the importance of CI failures. We need a proper
>>>>> negotiation with contributors to staff a team dedicated to
>>>>> the problem. We can then use the review process to ensure
>>>>> that the right people have an incentive to prioritise bug
>>>>> fixes.
>>>>> 
>>>>> Matt
>>>> 
>>>> _______________________________________________ OpenStack-dev
>>>> mailing list [email protected] 
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>>
>>>> 
_______________________________________________
>>> OpenStack-dev mailing list [email protected] 
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>>> 
_______________________________________________
>> OpenStack-dev mailing list [email protected] 
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCgAGBQJT1gvRAAoJEC5aWaUY1u57l4kH/0y5AIIq2VsKPr4TCfh//IsY
aJHgSdwjHIhfILTOd+jPCsAb8JgFi28fb0Txqwu/LBi0LknHFMHwFv8qd6kPU2wk
ou+/BhZSbUfVXTNeig5hTvU01QLRUxPvpUl2hynDLIu7Z9nkvrXNDoVH9j1tDgsf
sBMACaRcEMZnxGE1fosu0TxFVfuYtSnwj8y6RQve28ZplNUFMO+YthY94sqjWUsz
qTaqBrO0YmE5sTAOn/DQ8k2QdNHoJEcDZ5n1llEYoLtUMB8qn21GDr8hHi2ppV1/
7scKT3H98sTHIxGUegd13ORC9rwH5U0chy6AjpdjbDgqDk3Rm2K3tooF4Z0l148=
=nHC8
-----END PGP SIGNATURE-----

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [gate] The gate: a failure analysis

Reply via email to