Re: [openstack-dev] [gate] The gate: a failure analysis

Monty Taylor Mon, 28 Jul 2014 14:38:00 -0700

On 07/28/2014 02:32 PM, Angus Lees wrote:

On Mon, 28 Jul 2014 10:22:07 AM Doug Hellmann wrote:

On Jul 28, 2014, at 2:52 AM, Angus Lees <[email protected]> wrote:

On Mon, 21 Jul 2014 04:39:49 PM David Kranz wrote:

On 07/21/2014 04:13 PM, Jay Pipes wrote:

On 07/21/2014 02:03 PM, Clint Byrum wrote:

Thanks Matthew for the analysis.


I think you missed something though.

Right now the frustration is that unrelated intermittent bugs stop your
presumably good change from getting in.

Without gating, the result would be that even more bugs, many of them
not
intermittent at all, would get in. Right now, the one random developer
who has to hunt down the rechecks and do them is inconvenienced. But
without a gate, _every single_ developer will be inconvenienced until
the fix is merged.

The false negative rate is _way_ too high. Nobody would disagree there.
However, adding more false negatives and allowing more people to ignore
the ones we already have, seems like it would have the opposite effect:
Now instead of annoying the people who hit the random intermittent
bugs,
we'll be annoying _everybody_ as they hit the non-intermittent ones.

+10


Right, but perhaps there is a middle ground. We must not allow changes
in that can't pass through the gate, but we can separate the problems
of constant rechecks using too many resources, and of constant rechecks
causing developer pain. If failures were deterministic we would skip the
failing tests until they were fixed. Unfortunately many of the common
failures can blow up any test, or even the whole process. Following on
what Sam said, what if we automatically reran jobs that failed in a
known way, and disallowed "recheck/reverify no bug"? Developers would
then have to track down what bug caused a failure or file a new one. But
they would have to do so much less frequently, and as more common
failures were catalogued it would become less and less frequent.

Some might (reasonably) argue that this would be a bad thing because it
would reduce the incentive for people to fix bugs if there were less
pain being inflicted. But given how hard it is to track down these race
bugs, and that we as a community have no way to force time to be spent
on them, and that it does not appear that these bugs are causing real
systems to fall down (only our gating process), perhaps something
different should be considered?


So to pick an example dear to my heart, I've been working on removing
these
gate failures:
http://logstash.openstack.org/#eyJzZWFyY2giOiJcIkxvY2sgd2FpdCB0aW1lb3V0IGV
4Y2VlZGVkOyB0cnkgcmVzdGFydGluZyB0cmFuc2FjdGlvblwiIiwiZmllbGRzIjpbXSwib2Zmc
2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOns
idXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDA2NTI3OTA3NzkzfQ==

.. caused by a bad interaction between eventlet and our default choice of
mysql driver.  It would also affect any real world deployment using mysql.

The problem has been identified and the fix proposed for almost a month
now, but actually fixing the gate jobs is still no-where in sight.  The
fix is (pretty much) as easy as a pip install and a slightly modified
database connection string.
I look forward to a discussion of the meta-issues surrounding this, but it
is not because no-one tracked down or fixed the bug :(


I believe the main blocking issue right now is that Oracle doesn’t upload
that library to PyPI, and so our build-chain won’t be able to download it
as it is currently configured. I think the last I saw someone was going to
talk to Oracle about uploading the source. Have we heard back?


Yes, positive conversations are underway and we'll get there eventually.  My
point was also about apparent priorities, however.  If addressing gate
failures was *urgent*, we wouldn't wait for such a conversation to complete
before making our own workarounds(*).  I don't feel we (as a group) are
sufficiently terrified of false negatives.

(*) Indeed, the affected devstack gate tests install mysqlconnector via
debs/rpms.  I think only the oslo.db "opportunistic tests" talk to mysql via
pip-installed packages, and these don't also use eventlet.


Honestly, I think devstack installing it from apt/yum is fine.


Doug

- Gus

  -David

Best,
-jay

Excerpts from Matthew Booth's message of 2014-07-21 03:38:07 -0700:

On Friday evening I had a dependent series of 5 changes all with
approval waiting to be merged. These were all refactor changes in the
VMware driver. The changes were:

* VMware: DatastorePath join() and __eq__()
https://review.openstack.org/#/c/103949/

* VMware: use datastore classes get_allowed_datastores/_sub_folder
https://review.openstack.org/#/c/103950/

* VMware: use datastore classes in file_move/delete/exists, mkdir
https://review.openstack.org/#/c/103951/

* VMware: Trivial indentation cleanups in vmops
https://review.openstack.org/#/c/104149/

* VMware: Convert vmops to use instance as an object
https://review.openstack.org/#/c/104144/

The last change merged this morning.

In order to merge these changes, over the weekend I manually
submitted:

* 35 rechecks due to false negatives, an average of 7 per change
* 19 resubmissions after a change passed, but its dependency did not

Other interesting numbers:

* 16 unique bugs
* An 87% false negative rate
* 0 bugs found in the change under test

Because we don't fail fast, that is an average of at least 7.3 hours
in
the gate. Much more in fact, because some runs fail on the second
pass,
not the first. Because we don't resubmit automatically, that is only
if
a developer is actively monitoring the process continuously, and
resubmits immediately on failure. In practise this is much longer,
because sometimes we have to sleep.

All of the above numbers are counted from the change receiving an
approval +2 until final merging. There were far more failures than
this
during the approval process.

Why do we test individual changes in the gate? The purpose is to find
errors *in the change under test*. By the above numbers, it has failed
to achieve this at least 16 times previously.

Probability of finding a bug in the change under test: Small
Cost of testing:                                       High
Opportunity cost of slowing development:               High

and for comparison:

Cost of reverting rare false positives:                Small

The current process expends a lot of resources, and does not achieve
its
goal of finding bugs *in the changes under test*. In addition to
using a
lot of technical resources, it also prevents good change from making
its
way into the project and, not unimportantly, saps the will to live of
its victims. The cost of the process is overwhelmingly greater than
its
benefits. The gate process as it stands is a significant net
negative to
the project.

Does this mean that it is worthless to run these tests? Absolutely
not!
These tests are vital to highlight a severe quality deficiency in
OpenStack. Not addressing this is, imho, an existential risk to the
project. However, the current approach is to pick contributors from
the
community at random and hold them personally responsible for project
bugs selected at random. Not only has this approach failed, it is
impractical, unreasonable, and poisonous to the community at large. It
is also unrelated to the purpose of gate testing, which is to find
bugs
*in the changes under test*.

I would like to make the radical proposal that we stop gating on CI
failures. We will continue to run them on every change, but only after
the change has been successfully merged.

Benefits:
* Without rechecks, the gate will use 8 times fewer resources.
* Log analysis is still available to indicate the emergence of races.
* Fixes can be merged quicker.
* Vastly less developer time spent monitoring gate failures.

Costs:
* A rare class of merge bug will make it into master.

Note that the benefits above will also offset the cost of resolving
this
rare class of merge bug.

Of course, we still have the problem of finding resources to monitor
and
fix CI failures. An additional benefit of not gating on CI will be
that
we can no longer pretend that picking developers for project-affecting
bugs by lottery is likely to achieve results. As a project we need to
understand the importance of CI failures. We need a proper negotiation
with contributors to staff a team dedicated to the problem. We can
then
use the review process to ensure that the right people have an
incentive
to prioritise bug fixes.

Matt


_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [gate] The gate: a failure analysis

Reply via email to