OpenStack has a substantial CI system that is core to its development process. The goals of the system are to facilitate merging good code, prevent regressions, and ensure that there is at least one configuration of upstream OpenStack that we know works as a whole. The "project gating" technique that we use is effective at preventing many kinds of regressions from landing, however more subtle, non-deterministic bugs can still get through, and these are the bugs that are currently plaguing developers with seemingly random test failures.
Most of these bugs are not failures of the test system; they are real bugs. Many of them have even been in OpenStack for a long time, but are only becoming visible now due to improvements in our tests. That's not much help to developers whose patches are being hit with negative test results from unrelated failures. We need to find a way to address the non-deterministic bugs that are lurking in OpenStack without making it easier for new bugs to creep in. The CI system and project infrastructure are not static. They have evolved with the project to get to where they are today, and the challenge now is to continue to evolve them to address the problems we're seeing now. The QA and Infrastructure teams recently hosted a sprint where we discussed some of these issues in depth. This post from Sean Dague goes into a bit of the background: [1]. The rest of this email outlines the medium and long-term changes we would like to make to address these problems. [1] https://dague.net/2014/07/22/openstack-failures/ ==Things we're already doing== The elastic-recheck tool[2] is used to identify "random" failures in test runs. It tries to match failures to known bugs using signatures created from log messages. It helps developers prioritize bugs by how frequently they manifest as test failures. It also collects information on unclassified errors -- we can see how many (and which) test runs failed for an unknown reason and our overall progress on finding fingerprints for random failures. [2] http://status.openstack.org/elastic-recheck/ We added a feature to Zuul that lets us manually "promote" changes to the top of the Gate pipeline. When the QA team identifies a change that fixes a bug that is affecting overall gate stability, we can move that change to the top of the queue so that it may merge more quickly. We added the clean check facility in reaction to the January gate break down. While it does mean that any individual patch might see more tests run on it, it's now largely kept the gate queue at a countable number of hours, instead of regularly growing to more than a work day in length. It also means that a developer can Approve a code merge before tests have returned, and not ruin it for everyone else if there turned out to be a bug that the tests could catch. ==Future changes== ===Communication=== We used to be better at communicating about the CI system. As it and the project grew, we incrementally added to our institutional knowledge, but we haven't been good about maintaining that information in a form that new or existing contributors can consume to understand what's going on and why. We have started on a major effort in that direction that we call the "infra-manual" project -- it's designed to be a comprehensive "user manual" for the project infrastructure, including the CI process. Even before that project is complete, we will write a document that summarizes the CI system and ensure it is included in new developer documentation and linked to from test results. There are also a number of ways for people to get involved in the CI system, whether focused on Infrastructure or QA, but it is not always clear how to do so. We will improve our documentation to highlight how to contribute. ===Fixing Faster=== We introduce bugs to OpenStack at some constant rate, which piles up over time. Our systems currently treat all changes as equally risky and important to the health of the system, which makes landing code changes to fix key bugs slow when we're at a high reset rate. We've got a manual process of promoting changes today to get around this, but that's actually quite costly in people time, and takes getting all the right people together at once to promote changes. You can see a number of the changes we promoted during the gate storm in June [3], and it was no small number of fixes to get us back to a reasonably passing gate. We think that optimizing this system will help us land fixes to critical bugs faster. [3] https://etherpad.openstack.org/p/gatetriage-june2014 The basic idea is to use the data from elastic recheck to identify that a patch is fixing a critical gate related bug. When one of these is found in the queues it will be given higher priority, including bubbling up to the top of the gate queue automatically. The manual promote process should no longer be needed, and instead bugs fixing elastic recheck tracked issues will be promoted automatically. At the same time we'll also promote review on critical gate bugs through making them visible in a number of different channels (like on elastic recheck pages, review day, and in the gerrit dashboards). The idea here again is to make the reviews that fix key bugs pop to the top of everyone's views. ===Testing more tactically=== One of the challenges that exists today is that we've got basically 2 levels of testing in most of OpenStack: unit tests, and running a whole OpenStack cloud. Over time we've focused on adding more and more configurations and tests to the latter, but as we've seen, when things fail in a whole OpenStack cloud, getting to the root cause is often quite hard. So hard in fact that most people throw up their hands and just run 'recheck'. If a test run fails, and no one looks at why, does it provide any value? We need to get to a balance where we are testing that OpenStack works as a whole in some configuration, but as we've seen, even our best and brightest can't seem to make OpenStack reliably boot a compute that has working networking 100% the time if we happen to be running more than 1 API request at once. Getting there is a multi party process: * Reduce the gating configurations down to some gold standard configuration(s). This will be a small number of configurations that we all agree that everything will gate on. This means things like postgresql, cells, different environments will all get dropped from the gate as we know it. * Put the burden for a bunch of these tests back on the projects as "functional" tests. Basically a custom devstack environment that a project can create with a set of services that they minimally need to do their job. These functional tests will live in the project tree, not in Tempest, so can be atomically landed as part of the project normal development process. * For all non gold standard configurations, we'll dedicate a part of our infrastructure to running them in a continuous background loop, as well as making these configs available as experimental jobs. The idea here is that we'll actually be able to provide more configurations that are operating in a more traditional CI (post merge) context. People that are interested in keeping these bits functional can monitor those jobs and help with fixes when needed. The experimental jobs mean that if developers are concerned about the effect of a particular change on one of these configs, it's easy to request a pre-merge test run. In the near term we might imagine this would allow for things like ceph, mongodb, docker, and possibly very new libvirt to be validated in some way upstream. * Provide some kind of easy to view dashboards of these jobs, as well as a policy that if some job is failing for > some period of time, it's removed from the system. We want to provide whatever feedback we can to engaged parties, but people do need to realize that engagement is key. The biggest part of putting tests into OpenStack isn't landing the tests, but dealing with their failures. * Encourage projects to specifically land interface tests in other projects when they depend on certain behavior. Let's imagine an example of how this works in the real world. * The heat-slow job is deleted. * The heat team creates a specific functional job which tests some of their deeper function in Heat, all the tests live in Heat, and because of these the tests can include white/grey box testing of the DB and queues while things are progressing. * Nova lands a change which neither Tempest or our configs exercise, but breaks Heat. * The Heat project can now decide if it's more important to keep the test in place (preventing them from landing code), or to skip it to get back to work. * The Heat team then works on the right fix for Nova, or communicates with the Nova team on the issue at hand. The fix to Nova *also* should include tests which locks down that interface so that Nova won't break them again in the future (the ironic team did this with their test_ironic_contract patch). These tests could be unit tests, if they are testable that way, or functional tests in the Nova tree. * The Heat team then is back in business. This approach brings more control of when a project is blocked back into their own project. Tempest remains a final integration test to ensure that basics of the whole stack work together, but each project has a vertical testing stack which is specific to them as well. ==Final thoughts== The current rate of test failures and subsequent rechecks is not sustainable in the long term. It's not good for contributors, reveiewers, or the overall project quality. While these bugs do need to be addressed, it's unlikely that the current process will cause that to happen. Instead, we want to push more substantial testing into the projects themselves with functional and interface testing, and depend less on devstack-gate integration tests to catch all bugs. This should help us catch bugs closer to the source and in an environment where debugging is easier. We also want to reduce the scope of devstack gate tests to a gold standard while running tests of other configurations in a traditional CI process so that people interested in those configurations can focus on ensuring they work. Thanks, Jim and Sean _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev