On Thu, Jul 24, 2014 at 3:54 PM, Sean Dague <s...@dague.net> wrote: > On 07/24/2014 05:57 PM, Matthew Treinish wrote: > > On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote: > >> OpenStack has a substantial CI system that is core to its development > >> process. The goals of the system are to facilitate merging good code, > >> prevent regressions, and ensure that there is at least one configuration > >> of upstream OpenStack that we know works as a whole. The "project > >> gating" technique that we use is effective at preventing many kinds of > >> regressions from landing, however more subtle, non-deterministic bugs > >> can still get through, and these are the bugs that are currently > >> plaguing developers with seemingly random test failures. > >> > >> Most of these bugs are not failures of the test system; they are real > >> bugs. Many of them have even been in OpenStack for a long time, but are > >> only becoming visible now due to improvements in our tests. That's not > >> much help to developers whose patches are being hit with negative test > >> results from unrelated failures. We need to find a way to address the > >> non-deterministic bugs that are lurking in OpenStack without making it > >> easier for new bugs to creep in. > >> > >> The CI system and project infrastructure are not static. They have > >> evolved with the project to get to where they are today, and the > >> challenge now is to continue to evolve them to address the problems > >> we're seeing now. The QA and Infrastructure teams recently hosted a > >> sprint where we discussed some of these issues in depth. This post from > >> Sean Dague goes into a bit of the background: [1]. The rest of this > >> email outlines the medium and long-term changes we would like to make to > >> address these problems. > >> > >> [1] https://dague.net/2014/07/22/openstack-failures/ > >> > >> ==Things we're already doing== > >> > >> The elastic-recheck tool[2] is used to identify "random" failures in > >> test runs. It tries to match failures to known bugs using signatures > >> created from log messages. It helps developers prioritize bugs by how > >> frequently they manifest as test failures. It also collects information > >> on unclassified errors -- we can see how many (and which) test runs > >> failed for an unknown reason and our overall progress on finding > >> fingerprints for random failures. > >> > >> [2] http://status.openstack.org/elastic-recheck/ > >> > >> We added a feature to Zuul that lets us manually "promote" changes to > >> the top of the Gate pipeline. When the QA team identifies a change that > >> fixes a bug that is affecting overall gate stability, we can move that > >> change to the top of the queue so that it may merge more quickly. > >> > >> We added the clean check facility in reaction to the January gate break > >> down. While it does mean that any individual patch might see more tests > >> run on it, it's now largely kept the gate queue at a countable number of > >> hours, instead of regularly growing to more than a work day in > >> length. It also means that a developer can Approve a code merge before > >> tests have returned, and not ruin it for everyone else if there turned > >> out to be a bug that the tests could catch. > >> > >> ==Future changes== > >> > >> ===Communication=== > >> We used to be better at communicating about the CI system. As it and > >> the project grew, we incrementally added to our institutional knowledge, > >> but we haven't been good about maintaining that information in a form > >> that new or existing contributors can consume to understand what's going > >> on and why. > >> > >> We have started on a major effort in that direction that we call the > >> "infra-manual" project -- it's designed to be a comprehensive "user > >> manual" for the project infrastructure, including the CI process. Even > >> before that project is complete, we will write a document that > >> summarizes the CI system and ensure it is included in new developer > >> documentation and linked to from test results. > >> > >> There are also a number of ways for people to get involved in the CI > >> system, whether focused on Infrastructure or QA, but it is not always > >> clear how to do so. We will improve our documentation to highlight how > >> to contribute. > >> > >> ===Fixing Faster=== > >> > >> We introduce bugs to OpenStack at some constant rate, which piles up > >> over time. Our systems currently treat all changes as equally risky and > >> important to the health of the system, which makes landing code changes > >> to fix key bugs slow when we're at a high reset rate. We've got a manual > >> process of promoting changes today to get around this, but that's > >> actually quite costly in people time, and takes getting all the right > >> people together at once to promote changes. You can see a number of the > >> changes we promoted during the gate storm in June [3], and it was no > >> small number of fixes to get us back to a reasonably passing gate. We > >> think that optimizing this system will help us land fixes to critical > >> bugs faster. > >> > >> [3] https://etherpad.openstack.org/p/gatetriage-june2014 > >> > >> The basic idea is to use the data from elastic recheck to identify that > >> a patch is fixing a critical gate related bug. When one of these is > >> found in the queues it will be given higher priority, including bubbling > >> up to the top of the gate queue automatically. The manual promote > >> process should no longer be needed, and instead bugs fixing elastic > >> recheck tracked issues will be promoted automatically. > >> > >> At the same time we'll also promote review on critical gate bugs through > >> making them visible in a number of different channels (like on elastic > >> recheck pages, review day, and in the gerrit dashboards). The idea here > >> again is to make the reviews that fix key bugs pop to the top of > >> everyone's views. > >> > >> ===Testing more tactically=== > >> > >> One of the challenges that exists today is that we've got basically 2 > >> levels of testing in most of OpenStack: unit tests, and running a whole > >> OpenStack cloud. Over time we've focused on adding more and more > >> configurations and tests to the latter, but as we've seen, when things > >> fail in a whole OpenStack cloud, getting to the root cause is often > >> quite hard. So hard in fact that most people throw up their hands and > >> just run 'recheck'. If a test run fails, and no one looks at why, does > >> it provide any value? > >> > >> We need to get to a balance where we are testing that OpenStack works as > >> a whole in some configuration, but as we've seen, even our best and > >> brightest can't seem to make OpenStack reliably boot a compute that has > >> working networking 100% the time if we happen to be running more than 1 > >> API request at once. > >> > >> Getting there is a multi party process: > >> > >> * Reduce the gating configurations down to some gold standard > >> configuration(s). This will be a small number of configurations that > >> we all agree that everything will gate on. This means things like > >> postgresql, cells, different environments will all get dropped from > >> the gate as we know it. > >> > >> * Put the burden for a bunch of these tests back on the projects as > >> "functional" tests. Basically a custom devstack environment that a > >> project can create with a set of services that they minimally need > >> to do their job. These functional tests will live in the project > >> tree, not in Tempest, so can be atomically landed as part of the > >> project normal development process. > >> > >> * For all non gold standard configurations, we'll dedicate a part of > >> our infrastructure to running them in a continuous background loop, > >> as well as making these configs available as experimental jobs. The > >> idea here is that we'll actually be able to provide more > >> configurations that are operating in a more traditional CI (post > >> merge) context. People that are interested in keeping these bits > >> functional can monitor those jobs and help with fixes when needed. > >> The experimental jobs mean that if developers are concerned about > >> the effect of a particular change on one of these configs, it's easy > >> to request a pre-merge test run. In the near term we might imagine > >> this would allow for things like ceph, mongodb, docker, and possibly > >> very new libvirt to be validated in some way upstream. > >> > >> * Provide some kind of easy to view dashboards of these jobs, as well > >> as a policy that if some job is failing for > some period of time, > >> it's removed from the system. We want to provide whatever feedback > >> we can to engaged parties, but people do need to realize that > >> engagement is key. The biggest part of putting tests into OpenStack > >> isn't landing the tests, but dealing with their failures. > >> > >> * Encourage projects to specifically land interface tests in other > >> projects when they depend on certain behavior. > > > > So I think we (or least I do) need clarification around this item. My > question > > is which interfaces are we depending on that need these specific types of > > tests? Projects shouldn't be depending on another project's unstable > interfaces. > > If specific behavior is required for a cross-project interaction it > should be > > part of defined stable API, hopefully the REST API, and then that > behavior > > should be enforced for everyone not just the cross-project interaction. > > > > If I'm interpreting this correctly the what is actually needed here is to > > actually ensure that there is test coverage somewhere for the APIs that > should > > already be tested where there is a cross-project dependency. This is > actually > > the same thing we see all the time because there is a lack of test > coverage > > on certain APIs that are being used. (the nova default quotas example > comes to > > mind) I just think calling this a special class of test is a bit > misleading. > > Since it shouldn't actually differ than any other API test. Or am I > missing > > something? > > Projects are consuming the behavior of other projects far beyond just > the formal REST APIs. Notifications is another great instance of that. >
I think the fact that notification aren't versioned or really 'contractual' but are being used as such to be a huge issue. > > This is also more of a pragmatic organic approach to figuring out the > interfaces we need to lock down. When one projects breaks depending on > an interface in another project, that should trigger this kind of > contract growth, which hopefully formally turns into a document later > for a stable interface. > This approach sounds like a recipe for us playing a never ending game of catch up, when issues like this have arrisen in the past we rarely get around to creating a stable interface (see notifications). I would rather push on using stable APIs (and creating them as needed) instead. > > >> Let's imagine an example of how this works in the real world. > >> > >> * The heat-slow job is deleted. > >> > >> * The heat team creates a specific functional job which tests some of > >> their deeper function in Heat, all the tests live in Heat, and > >> because of these the tests can include white/grey box testing of the > >> DB and queues while things are progressing. > >> > >> * Nova lands a change which neither Tempest or our configs exercise, > >> but breaks Heat. > >> > >> * The Heat project can now decide if it's more important to keep the > >> test in place (preventing them from landing code), or to skip it to > >> get back to work. > >> > >> * The Heat team then works on the right fix for Nova, or communicates > >> with the Nova team on the issue at hand. The fix to Nova *also* > >> should include tests which locks down that interface so that Nova > >> won't break them again in the future (the ironic team did this with > >> their test_ironic_contract patch). These tests could be unit tests, > >> if they are testable that way, or functional tests in the Nova tree. > > > > The one thing I want to point out here is that ironic_contract test > should be > > an exception, I don't feel that we want that to be the norm. It's not a > good > > example for a few reasons, mostly around the fact that ironic tree > depends on > > the purposefully unstable nova driver api as temporary measure until the > ironic > > driver is merged into the nova tree. The contract api tests will go away > once > > the driver is in the nova tree. It should not be necessary for something > over > > the REST API, since the contact should be enforced through tempest. > (even under > > this new model, I expect this to still be true) > > > > There was that comment which someone (I can't remember who) brought up > at the > > Portland summit that tempest is acting like double book accounting for > the api > > contract, and that has been something we've seen as extremely valuable > > historically. Which is why I don't want to see this aspect of tempest's > role in > > the gate altered. > > I've been the holder of the double book accounting pov in the past. > However, after the last six months of fragility, I just don't see how > that's a sustainable point of view. The QA team remains somewhat > constant size, and the number of interfaces and projects grows at a good > clip. > > > Although, all I think we actually need is an api definition for testing > in an > > external repo, just to prevent inadvertent changes. (whether that gets > used in > > tempest or not) So another alternative I see here is something that I've > started > > to outline in [4] to address the potential for code duplication and > effort in > > the new functional test suites. If all the project specific functional > tests are > > using clients from an external functional testing library repo then this > concern > > goes away. > > Actually, I don't think these wold be using external clients. This is in > tree testing. > > This will definitely be an experiment to get the API testing closer to > the source. That being said, Swift really has done this fine for a long > time, and I think we need to revisit the premise that projects can't be > trusted. > > > Now, if something like this example were to be exposed because of a > coverage > > gap I think it's fair game to have a specific test in nova's functional > test > > suite. But, I also think there should be an external audit of that API > somewhere > > too. Ideally I think what I'd like to see is probably a write-once test > > graduation procedure for moving appropriate things into tempest (or > somewhere > > else) from the project specific functional tests. Basically like what we > > discussed during Maru's summit session on Neutron functional testing in > Atlanta. > > Right, and I think basically we shouldn't graduate most of those tests. > They are neutron tests, in the neutron tree. A few key ones we decide > should be run outside that context. > > > For the other, more social, goal of this step in fostering communication > between > > the projects and not using QA and/or Infra as a middle man I fully > support. I > > agree that we probably have too proxying going on between projects using > QA > > and/or infra instead of necessarily talking directly. > > Our current model leans far too much on the idea of the only time we > ever try to test things for real is when we throw all 1 million lines of > source code into one pot and stir. It really shouldn't be surprising how > many bugs shake out there. And this is the wrong layer to debug from, so > I firmly believe we need to change this back to something we can > actually manage to shake the bugs out with. Because right now we're > finding them, but our infrastructure isn't optimized for fixing them, > and we need to change that. > > -Sean > > -- > Sean Dague > http://dague.net > > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev