On Jul 28, 2014, at 2:52 AM, Angus Lees <g...@inodes.org> wrote: > On Mon, 21 Jul 2014 04:39:49 PM David Kranz wrote: >> On 07/21/2014 04:13 PM, Jay Pipes wrote: >>> On 07/21/2014 02:03 PM, Clint Byrum wrote: >>>> Thanks Matthew for the analysis. >>>> >>>> I think you missed something though. >>>> >>>> Right now the frustration is that unrelated intermittent bugs stop your >>>> presumably good change from getting in. >>>> >>>> Without gating, the result would be that even more bugs, many of them >>>> not >>>> intermittent at all, would get in. Right now, the one random developer >>>> who has to hunt down the rechecks and do them is inconvenienced. But >>>> without a gate, _every single_ developer will be inconvenienced until >>>> the fix is merged. >>>> >>>> The false negative rate is _way_ too high. Nobody would disagree there. >>>> However, adding more false negatives and allowing more people to ignore >>>> the ones we already have, seems like it would have the opposite effect: >>>> Now instead of annoying the people who hit the random intermittent bugs, >>>> we'll be annoying _everybody_ as they hit the non-intermittent ones. >>> >>> +10 >> >> Right, but perhaps there is a middle ground. We must not allow changes >> in that can't pass through the gate, but we can separate the problems >> of constant rechecks using too many resources, and of constant rechecks >> causing developer pain. If failures were deterministic we would skip the >> failing tests until they were fixed. Unfortunately many of the common >> failures can blow up any test, or even the whole process. Following on >> what Sam said, what if we automatically reran jobs that failed in a >> known way, and disallowed "recheck/reverify no bug"? Developers would >> then have to track down what bug caused a failure or file a new one. But >> they would have to do so much less frequently, and as more common >> failures were catalogued it would become less and less frequent. >> >> Some might (reasonably) argue that this would be a bad thing because it >> would reduce the incentive for people to fix bugs if there were less >> pain being inflicted. But given how hard it is to track down these race >> bugs, and that we as a community have no way to force time to be spent >> on them, and that it does not appear that these bugs are causing real >> systems to fall down (only our gating process), perhaps something >> different should be considered? > > So to pick an example dear to my heart, I've been working on removing these > gate failures: > http://logstash.openstack.org/#eyJzZWFyY2giOiJcIkxvY2sgd2FpdCB0aW1lb3V0IGV4Y2VlZGVkOyB0cnkgcmVzdGFydGluZyB0cmFuc2FjdGlvblwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA2NTI3OTA3NzkzfQ== > > .. caused by a bad interaction between eventlet and our default choice of > mysql driver. It would also affect any real world deployment using mysql. > > The problem has been identified and the fix proposed for almost a month now, > but > actually fixing the gate jobs is still no-where in sight. The fix is (pretty > much) as easy as a pip install and a slightly modified database connection > string. > I look forward to a discussion of the meta-issues surrounding this, but it is > not because no-one tracked down or fixed the bug :(
I believe the main blocking issue right now is that Oracle doesn’t upload that library to PyPI, and so our build-chain won’t be able to download it as it is currently configured. I think the last I saw someone was going to talk to Oracle about uploading the source. Have we heard back? Doug > > - Gus > >> -David >> >>> Best, >>> -jay >>> >>>> Excerpts from Matthew Booth's message of 2014-07-21 03:38:07 -0700: >>>>> On Friday evening I had a dependent series of 5 changes all with >>>>> approval waiting to be merged. These were all refactor changes in the >>>>> VMware driver. The changes were: >>>>> >>>>> * VMware: DatastorePath join() and __eq__() >>>>> https://review.openstack.org/#/c/103949/ >>>>> >>>>> * VMware: use datastore classes get_allowed_datastores/_sub_folder >>>>> https://review.openstack.org/#/c/103950/ >>>>> >>>>> * VMware: use datastore classes in file_move/delete/exists, mkdir >>>>> https://review.openstack.org/#/c/103951/ >>>>> >>>>> * VMware: Trivial indentation cleanups in vmops >>>>> https://review.openstack.org/#/c/104149/ >>>>> >>>>> * VMware: Convert vmops to use instance as an object >>>>> https://review.openstack.org/#/c/104144/ >>>>> >>>>> The last change merged this morning. >>>>> >>>>> In order to merge these changes, over the weekend I manually submitted: >>>>> >>>>> * 35 rechecks due to false negatives, an average of 7 per change >>>>> * 19 resubmissions after a change passed, but its dependency did not >>>>> >>>>> Other interesting numbers: >>>>> >>>>> * 16 unique bugs >>>>> * An 87% false negative rate >>>>> * 0 bugs found in the change under test >>>>> >>>>> Because we don't fail fast, that is an average of at least 7.3 hours in >>>>> the gate. Much more in fact, because some runs fail on the second pass, >>>>> not the first. Because we don't resubmit automatically, that is only if >>>>> a developer is actively monitoring the process continuously, and >>>>> resubmits immediately on failure. In practise this is much longer, >>>>> because sometimes we have to sleep. >>>>> >>>>> All of the above numbers are counted from the change receiving an >>>>> approval +2 until final merging. There were far more failures than this >>>>> during the approval process. >>>>> >>>>> Why do we test individual changes in the gate? The purpose is to find >>>>> errors *in the change under test*. By the above numbers, it has failed >>>>> to achieve this at least 16 times previously. >>>>> >>>>> Probability of finding a bug in the change under test: Small >>>>> Cost of testing: High >>>>> Opportunity cost of slowing development: High >>>>> >>>>> and for comparison: >>>>> >>>>> Cost of reverting rare false positives: Small >>>>> >>>>> The current process expends a lot of resources, and does not achieve >>>>> its >>>>> goal of finding bugs *in the changes under test*. In addition to >>>>> using a >>>>> lot of technical resources, it also prevents good change from making >>>>> its >>>>> way into the project and, not unimportantly, saps the will to live of >>>>> its victims. The cost of the process is overwhelmingly greater than its >>>>> benefits. The gate process as it stands is a significant net >>>>> negative to >>>>> the project. >>>>> >>>>> Does this mean that it is worthless to run these tests? Absolutely not! >>>>> These tests are vital to highlight a severe quality deficiency in >>>>> OpenStack. Not addressing this is, imho, an existential risk to the >>>>> project. However, the current approach is to pick contributors from the >>>>> community at random and hold them personally responsible for project >>>>> bugs selected at random. Not only has this approach failed, it is >>>>> impractical, unreasonable, and poisonous to the community at large. It >>>>> is also unrelated to the purpose of gate testing, which is to find bugs >>>>> *in the changes under test*. >>>>> >>>>> I would like to make the radical proposal that we stop gating on CI >>>>> failures. We will continue to run them on every change, but only after >>>>> the change has been successfully merged. >>>>> >>>>> Benefits: >>>>> * Without rechecks, the gate will use 8 times fewer resources. >>>>> * Log analysis is still available to indicate the emergence of races. >>>>> * Fixes can be merged quicker. >>>>> * Vastly less developer time spent monitoring gate failures. >>>>> >>>>> Costs: >>>>> * A rare class of merge bug will make it into master. >>>>> >>>>> Note that the benefits above will also offset the cost of resolving >>>>> this >>>>> rare class of merge bug. >>>>> >>>>> Of course, we still have the problem of finding resources to monitor >>>>> and >>>>> fix CI failures. An additional benefit of not gating on CI will be that >>>>> we can no longer pretend that picking developers for project-affecting >>>>> bugs by lottery is likely to achieve results. As a project we need to >>>>> understand the importance of CI failures. We need a proper negotiation >>>>> with contributors to staff a team dedicated to the problem. We can then >>>>> use the review process to ensure that the right people have an >>>>> incentive >>>>> to prioritise bug fixes. >>>>> >>>>> Matt >>>> >>>> _______________________________________________ >>>> OpenStack-dev mailing list >>>> OpenStack-dev@lists.openstack.org >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >>> _______________________________________________ >>> OpenStack-dev mailing list >>> OpenStack-dev@lists.openstack.org >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> _______________________________________________ >> OpenStack-dev mailing list >> OpenStack-dev@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > -- > - Gus > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev