-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On 28/07/14 08:52, Angus Lees wrote: > On Mon, 21 Jul 2014 04:39:49 PM David Kranz wrote: >> On 07/21/2014 04:13 PM, Jay Pipes wrote: >>> On 07/21/2014 02:03 PM, Clint Byrum wrote: >>>> Thanks Matthew for the analysis. >>>> >>>> I think you missed something though. >>>> >>>> Right now the frustration is that unrelated intermittent bugs >>>> stop your presumably good change from getting in. >>>> >>>> Without gating, the result would be that even more bugs, many >>>> of them not intermittent at all, would get in. Right now, the >>>> one random developer who has to hunt down the rechecks and do >>>> them is inconvenienced. But without a gate, _every single_ >>>> developer will be inconvenienced until the fix is merged. >>>> >>>> The false negative rate is _way_ too high. Nobody would >>>> disagree there. However, adding more false negatives and >>>> allowing more people to ignore the ones we already have, >>>> seems like it would have the opposite effect: Now instead of >>>> annoying the people who hit the random intermittent bugs, >>>> we'll be annoying _everybody_ as they hit the >>>> non-intermittent ones. >>> >>> +10 >> >> Right, but perhaps there is a middle ground. We must not allow >> changes in that can't pass through the gate, but we can separate >> the problems of constant rechecks using too many resources, and >> of constant rechecks causing developer pain. If failures were >> deterministic we would skip the failing tests until they were >> fixed. Unfortunately many of the common failures can blow up any >> test, or even the whole process. Following on what Sam said, what >> if we automatically reran jobs that failed in a known way, and >> disallowed "recheck/reverify no bug"? Developers would then have >> to track down what bug caused a failure or file a new one. But >> they would have to do so much less frequently, and as more >> common failures were catalogued it would become less and less >> frequent. >> >> Some might (reasonably) argue that this would be a bad thing >> because it would reduce the incentive for people to fix bugs if >> there were less pain being inflicted. But given how hard it is to >> track down these race bugs, and that we as a community have no >> way to force time to be spent on them, and that it does not >> appear that these bugs are causing real systems to fall down >> (only our gating process), perhaps something different should be >> considered? > > So to pick an example dear to my heart, I've been working on > removing these gate failures: > http://logstash.openstack.org/#eyJzZWFyY2giOiJcIkxvY2sgd2FpdCB0aW1lb3V0IGV4Y2VlZGVkOyB0cnkgcmVzdGFydGluZyB0cmFuc2FjdGlvblwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA2NTI3OTA3NzkzfQ== > > .. caused by a bad interaction between eventlet and our default > choice of mysql driver. It would also affect any real world > deployment using mysql. > > The problem has been identified and the fix proposed for almost a > month now, but actually fixing the gate jobs is still no-where in > sight. The fix is (pretty much) as easy as a pip install and a > slightly modified database connection string.
[And to highjack the thread even more] The fix Angus has kindly referred to is: - - spec: https://review.openstack.org/#/c/108355/ - - devstack: https://review.openstack.org/#/c/105209/ (plus several tiny fixes in multiple projects to make sure the patch succeeds in db migration). > I look forward to a discussion of the meta-issues surrounding this, > but it is not because no-one tracked down or fixed the bug :( > > - Gus > >> -David >> >>> Best, -jay >>> >>>> Excerpts from Matthew Booth's message of 2014-07-21 03:38:07 >>>> -0700: >>>>> On Friday evening I had a dependent series of 5 changes all >>>>> with approval waiting to be merged. These were all refactor >>>>> changes in the VMware driver. The changes were: >>>>> >>>>> * VMware: DatastorePath join() and __eq__() >>>>> https://review.openstack.org/#/c/103949/ >>>>> >>>>> * VMware: use datastore classes >>>>> get_allowed_datastores/_sub_folder >>>>> https://review.openstack.org/#/c/103950/ >>>>> >>>>> * VMware: use datastore classes in file_move/delete/exists, >>>>> mkdir https://review.openstack.org/#/c/103951/ >>>>> >>>>> * VMware: Trivial indentation cleanups in vmops >>>>> https://review.openstack.org/#/c/104149/ >>>>> >>>>> * VMware: Convert vmops to use instance as an object >>>>> https://review.openstack.org/#/c/104144/ >>>>> >>>>> The last change merged this morning. >>>>> >>>>> In order to merge these changes, over the weekend I >>>>> manually submitted: >>>>> >>>>> * 35 rechecks due to false negatives, an average of 7 per >>>>> change * 19 resubmissions after a change passed, but its >>>>> dependency did not >>>>> >>>>> Other interesting numbers: >>>>> >>>>> * 16 unique bugs * An 87% false negative rate * 0 bugs >>>>> found in the change under test >>>>> >>>>> Because we don't fail fast, that is an average of at least >>>>> 7.3 hours in the gate. Much more in fact, because some runs >>>>> fail on the second pass, not the first. Because we don't >>>>> resubmit automatically, that is only if a developer is >>>>> actively monitoring the process continuously, and resubmits >>>>> immediately on failure. In practise this is much longer, >>>>> because sometimes we have to sleep. >>>>> >>>>> All of the above numbers are counted from the change >>>>> receiving an approval +2 until final merging. There were >>>>> far more failures than this during the approval process. >>>>> >>>>> Why do we test individual changes in the gate? The purpose >>>>> is to find errors *in the change under test*. By the above >>>>> numbers, it has failed to achieve this at least 16 times >>>>> previously. >>>>> >>>>> Probability of finding a bug in the change under test: >>>>> Small Cost of testing: >>>>> High Opportunity cost of slowing development: >>>>> High >>>>> >>>>> and for comparison: >>>>> >>>>> Cost of reverting rare false positives: >>>>> Small >>>>> >>>>> The current process expends a lot of resources, and does >>>>> not achieve its goal of finding bugs *in the changes under >>>>> test*. In addition to using a lot of technical resources, >>>>> it also prevents good change from making its way into the >>>>> project and, not unimportantly, saps the will to live of >>>>> its victims. The cost of the process is overwhelmingly >>>>> greater than its benefits. The gate process as it stands is >>>>> a significant net negative to the project. >>>>> >>>>> Does this mean that it is worthless to run these tests? >>>>> Absolutely not! These tests are vital to highlight a severe >>>>> quality deficiency in OpenStack. Not addressing this is, >>>>> imho, an existential risk to the project. However, the >>>>> current approach is to pick contributors from the community >>>>> at random and hold them personally responsible for project >>>>> bugs selected at random. Not only has this approach failed, >>>>> it is impractical, unreasonable, and poisonous to the >>>>> community at large. It is also unrelated to the purpose of >>>>> gate testing, which is to find bugs *in the changes under >>>>> test*. >>>>> >>>>> I would like to make the radical proposal that we stop >>>>> gating on CI failures. We will continue to run them on >>>>> every change, but only after the change has been >>>>> successfully merged. >>>>> >>>>> Benefits: * Without rechecks, the gate will use 8 times >>>>> fewer resources. * Log analysis is still available to >>>>> indicate the emergence of races. * Fixes can be merged >>>>> quicker. * Vastly less developer time spent monitoring gate >>>>> failures. >>>>> >>>>> Costs: * A rare class of merge bug will make it into >>>>> master. >>>>> >>>>> Note that the benefits above will also offset the cost of >>>>> resolving this rare class of merge bug. >>>>> >>>>> Of course, we still have the problem of finding resources >>>>> to monitor and fix CI failures. An additional benefit of >>>>> not gating on CI will be that we can no longer pretend that >>>>> picking developers for project-affecting bugs by lottery is >>>>> likely to achieve results. As a project we need to >>>>> understand the importance of CI failures. We need a proper >>>>> negotiation with contributors to staff a team dedicated to >>>>> the problem. We can then use the review process to ensure >>>>> that the right people have an incentive to prioritise bug >>>>> fixes. >>>>> >>>>> Matt >>>> >>>> _______________________________________________ OpenStack-dev >>>> mailing list OpenStack-dev@lists.openstack.org >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >>> >>>> _______________________________________________ >>> OpenStack-dev mailing list OpenStack-dev@lists.openstack.org >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> >>> _______________________________________________ >> OpenStack-dev mailing list OpenStack-dev@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.22 (Darwin) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCgAGBQJT1gvRAAoJEC5aWaUY1u57l4kH/0y5AIIq2VsKPr4TCfh//IsY aJHgSdwjHIhfILTOd+jPCsAb8JgFi28fb0Txqwu/LBi0LknHFMHwFv8qd6kPU2wk ou+/BhZSbUfVXTNeig5hTvU01QLRUxPvpUl2hynDLIu7Z9nkvrXNDoVH9j1tDgsf sBMACaRcEMZnxGE1fosu0TxFVfuYtSnwj8y6RQve28ZplNUFMO+YthY94sqjWUsz qTaqBrO0YmE5sTAOn/DQ8k2QdNHoJEcDZ5n1llEYoLtUMB8qn21GDr8hHi2ppV1/ 7scKT3H98sTHIxGUegd13ORC9rwH5U0chy6AjpdjbDgqDk3Rm2K3tooF4Z0l148= =nHC8 -----END PGP SIGNATURE----- _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev