Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

Joe Gordon Fri, 25 Jul 2014 11:08:25 -0700

On Thu, Jul 24, 2014 at 3:54 PM, Sean Dague <s...@dague.net> wrote:

> On 07/24/2014 05:57 PM, Matthew Treinish wrote:
> > On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
> >> OpenStack has a substantial CI system that is core to its development
> >> process.  The goals of the system are to facilitate merging good code,
> >> prevent regressions, and ensure that there is at least one configuration
> >> of upstream OpenStack that we know works as a whole.  The "project
> >> gating" technique that we use is effective at preventing many kinds of
> >> regressions from landing, however more subtle, non-deterministic bugs
> >> can still get through, and these are the bugs that are currently
> >> plaguing developers with seemingly random test failures.
> >>
> >> Most of these bugs are not failures of the test system; they are real
> >> bugs.  Many of them have even been in OpenStack for a long time, but are
> >> only becoming visible now due to improvements in our tests.  That's not
> >> much help to developers whose patches are being hit with negative test
> >> results from unrelated failures.  We need to find a way to address the
> >> non-deterministic bugs that are lurking in OpenStack without making it
> >> easier for new bugs to creep in.
> >>
> >> The CI system and project infrastructure are not static.  They have
> >> evolved with the project to get to where they are today, and the
> >> challenge now is to continue to evolve them to address the problems
> >> we're seeing now.  The QA and Infrastructure teams recently hosted a
> >> sprint where we discussed some of these issues in depth.  This post from
> >> Sean Dague goes into a bit of the background: [1].  The rest of this
> >> email outlines the medium and long-term changes we would like to make to
> >> address these problems.
> >>
> >> [1] https://dague.net/2014/07/22/openstack-failures/
> >>
> >> ==Things we're already doing==
> >>
> >> The elastic-recheck tool[2] is used to identify "random" failures in
> >> test runs.  It tries to match failures to known bugs using signatures
> >> created from log messages.  It helps developers prioritize bugs by how
> >> frequently they manifest as test failures.  It also collects information
> >> on unclassified errors -- we can see how many (and which) test runs
> >> failed for an unknown reason and our overall progress on finding
> >> fingerprints for random failures.
> >>
> >> [2] http://status.openstack.org/elastic-recheck/
> >>
> >> We added a feature to Zuul that lets us manually "promote" changes to
> >> the top of the Gate pipeline.  When the QA team identifies a change that
> >> fixes a bug that is affecting overall gate stability, we can move that
> >> change to the top of the queue so that it may merge more quickly.
> >>
> >> We added the clean check facility in reaction to the January gate break
> >> down. While it does mean that any individual patch might see more tests
> >> run on it, it's now largely kept the gate queue at a countable number of
> >> hours, instead of regularly growing to more than a work day in
> >> length. It also means that a developer can Approve a code merge before
> >> tests have returned, and not ruin it for everyone else if there turned
> >> out to be a bug that the tests could catch.
> >>
> >> ==Future changes==
> >>
> >> ===Communication===
> >> We used to be better at communicating about the CI system.  As it and
> >> the project grew, we incrementally added to our institutional knowledge,
> >> but we haven't been good about maintaining that information in a form
> >> that new or existing contributors can consume to understand what's going
> >> on and why.
> >>
> >> We have started on a major effort in that direction that we call the
> >> "infra-manual" project -- it's designed to be a comprehensive "user
> >> manual" for the project infrastructure, including the CI process.  Even
> >> before that project is complete, we will write a document that
> >> summarizes the CI system and ensure it is included in new developer
> >> documentation and linked to from test results.
> >>
> >> There are also a number of ways for people to get involved in the CI
> >> system, whether focused on Infrastructure or QA, but it is not always
> >> clear how to do so.  We will improve our documentation to highlight how
> >> to contribute.
> >>
> >> ===Fixing Faster===
> >>
> >> We introduce bugs to OpenStack at some constant rate, which piles up
> >> over time. Our systems currently treat all changes as equally risky and
> >> important to the health of the system, which makes landing code changes
> >> to fix key bugs slow when we're at a high reset rate. We've got a manual
> >> process of promoting changes today to get around this, but that's
> >> actually quite costly in people time, and takes getting all the right
> >> people together at once to promote changes. You can see a number of the
> >> changes we promoted during the gate storm in June [3], and it was no
> >> small number of fixes to get us back to a reasonably passing gate. We
> >> think that optimizing this system will help us land fixes to critical
> >> bugs faster.
> >>
> >> [3] https://etherpad.openstack.org/p/gatetriage-june2014
> >>
> >> The basic idea is to use the data from elastic recheck to identify that
> >> a patch is fixing a critical gate related bug. When one of these is
> >> found in the queues it will be given higher priority, including bubbling
> >> up to the top of the gate queue automatically. The manual promote
> >> process should no longer be needed, and instead bugs fixing elastic
> >> recheck tracked issues will be promoted automatically.
> >>
> >> At the same time we'll also promote review on critical gate bugs through
> >> making them visible in a number of different channels (like on elastic
> >> recheck pages, review day, and in the gerrit dashboards). The idea here
> >> again is to make the reviews that fix key bugs pop to the top of
> >> everyone's views.
> >>
> >> ===Testing more tactically===
> >>
> >> One of the challenges that exists today is that we've got basically 2
> >> levels of testing in most of OpenStack: unit tests, and running a whole
> >> OpenStack cloud. Over time we've focused on adding more and more
> >> configurations and tests to the latter, but as we've seen, when things
> >> fail in a whole OpenStack cloud, getting to the root cause is often
> >> quite hard. So hard in fact that most people throw up their hands and
> >> just run 'recheck'. If a test run fails, and no one looks at why, does
> >> it provide any value?
> >>
> >> We need to get to a balance where we are testing that OpenStack works as
> >> a whole in some configuration, but as we've seen, even our best and
> >> brightest can't seem to make OpenStack reliably boot a compute that has
> >> working networking 100% the time if we happen to be running more than 1
> >> API request at once.
> >>
> >> Getting there is a multi party process:
> >>
> >>   * Reduce the gating configurations down to some gold standard
> >>     configuration(s). This will be a small number of configurations that
> >>     we all agree that everything will gate on. This means things like
> >>     postgresql, cells, different environments will all get dropped from
> >>     the gate as we know it.
> >>
> >>   * Put the burden for a bunch of these tests back on the projects as
> >>     "functional" tests. Basically a custom devstack environment that a
> >>     project can create with a set of services that they minimally need
> >>     to do their job. These functional tests will live in the project
> >>     tree, not in Tempest, so can be atomically landed as part of the
> >>     project normal development process.
> >>
> >>   * For all non gold standard configurations, we'll dedicate a part of
> >>     our infrastructure to running them in a continuous background loop,
> >>     as well as making these configs available as experimental jobs. The
> >>     idea here is that we'll actually be able to provide more
> >>     configurations that are operating in a more traditional CI (post
> >>     merge) context. People that are interested in keeping these bits
> >>     functional can monitor those jobs and help with fixes when needed.
> >>     The experimental jobs mean that if developers are concerned about
> >>     the effect of a particular change on one of these configs, it's easy
> >>     to request a pre-merge test run.  In the near term we might imagine
> >>     this would allow for things like ceph, mongodb, docker, and possibly
> >>     very new libvirt to be validated in some way upstream.
> >>
> >>   * Provide some kind of easy to view dashboards of these jobs, as well
> >>     as a policy that if some job is failing for > some period of time,
> >>     it's removed from the system. We want to provide whatever feedback
> >>     we can to engaged parties, but people do need to realize that
> >>     engagement is key. The biggest part of putting tests into OpenStack
> >>     isn't landing the tests, but dealing with their failures.
> >>
> >>   * Encourage projects to specifically land interface tests in other
> >>     projects when they depend on certain behavior.
> >
> > So I think we (or least I do) need clarification around this item. My
> question
> > is which interfaces are we depending on that need these specific types of
> > tests? Projects shouldn't be depending on another project's unstable
> interfaces.
> > If specific behavior is required for a cross-project interaction it
> should be
> > part of defined stable API, hopefully the REST API, and then that
> behavior
> > should be enforced for everyone not just the cross-project interaction.
> >
> > If I'm interpreting this correctly the what is actually needed here is to
> > actually ensure that there is test coverage somewhere for the APIs that
> should
> > already be tested where there is a cross-project dependency. This is
> actually
> > the same thing we see all the time because there is a lack of test
> coverage
> > on certain APIs that are being used. (the nova default quotas example
> comes to
> > mind) I just think calling this a special class of test is a bit
> misleading.
> > Since it shouldn't actually differ than any other API test. Or am I
> missing
> > something?
>
> Projects are consuming the behavior of other projects far beyond just
> the formal REST APIs. Notifications is another great instance of that.
>


I think the fact that notification aren't versioned or really 'contractual'
but are being used as such to be a huge issue.


>
> This is also more of a pragmatic organic approach to figuring out the
> interfaces we need to lock down. When one projects breaks depending on
> an interface in another project, that should trigger this kind of
> contract growth, which hopefully formally turns into a document later
> for a stable interface.
>

This approach sounds like a recipe for us playing a never ending game of
catch up, when issues like this have arrisen in the past we rarely get
around to creating a stable interface (see notifications). I would rather
push on using stable APIs (and creating them as needed) instead.


>
> >> Let's imagine an example of how this works in the real world.
> >>
> >>   * The heat-slow job is deleted.
> >>
> >>   * The heat team creates a specific functional job which tests some of
> >>     their deeper function in Heat, all the tests live in Heat, and
> >>     because of these the tests can include white/grey box testing of the
> >>     DB and queues while things are progressing.
> >>
> >>   * Nova lands a change which neither Tempest or our configs exercise,
> >>     but breaks Heat.
> >>
> >>   * The Heat project can now decide if it's more important to keep the
> >>     test in place (preventing them from landing code), or to skip it to
> >>     get back to work.
> >>
> >>   * The Heat team then works on the right fix for Nova, or communicates
> >>     with the Nova team on the issue at hand. The fix to Nova *also*
> >>     should include tests which locks down that interface so that Nova
> >>     won't break them again in the future (the ironic team did this with
> >>     their test_ironic_contract patch). These tests could be unit tests,
> >>     if they are testable that way, or functional tests in the Nova tree.
> >
> > The one thing I want to point out here is that ironic_contract test
> should be
> > an exception, I don't feel that we want that to be the norm. It's not a
> good
> > example for a few reasons, mostly around the fact that ironic tree
> depends on
> > the purposefully unstable nova driver api as temporary measure until the
> ironic
> > driver is merged into the nova tree. The contract api tests will go away
> once
> > the driver is in the nova tree. It should not be necessary for something
> over
> > the REST API, since the contact should be enforced through tempest.
> (even under
> > this new model, I expect this to still be true)
> >
> > There was that comment which someone (I can't remember who) brought up
> at the
> > Portland summit that tempest is acting like double book accounting for
> the api
> > contract, and that has been something we've seen as extremely valuable
> > historically. Which is why I don't want to see this aspect of tempest's
> role in
> > the gate altered.
>
> I've been the holder of the double book accounting pov in the past.
> However, after the last six months of fragility, I just don't see how
> that's a sustainable point of view. The QA team remains somewhat
> constant size, and the number of interfaces and projects grows at a good
> clip.
>
> > Although, all I think we actually need is an api definition for testing
> in an
> > external repo, just to prevent inadvertent changes. (whether that gets
> used in
> > tempest or not) So another alternative I see here is something that I've
> started
> > to outline in [4] to address the potential for code duplication and
> effort in
> > the new functional test suites. If all the project specific functional
> tests are
> > using clients from an external functional testing library repo then this
> concern
> > goes away.
>
> Actually, I don't think these wold be using external clients. This is in
> tree testing.
>
> This will definitely be an experiment to get the API testing closer to
> the source. That being said, Swift really has done this fine for a long
> time, and I think we need to revisit the premise that projects can't be
> trusted.
>
> > Now, if something like this example were to be exposed because of a
> coverage
> > gap I think it's fair game to have a specific test in nova's functional
> test
> > suite. But, I also think there should be an external audit of that API
> somewhere
> > too. Ideally I think what I'd like to see is probably a write-once test
> > graduation procedure for moving appropriate things into tempest (or
> somewhere
> > else) from the project specific functional tests. Basically like what we
> > discussed during Maru's summit session on Neutron functional testing in
> Atlanta.
>
> Right, and I think basically we shouldn't graduate most of those tests.
> They are neutron tests, in the neutron tree. A few key ones we decide
> should be run outside that context.
>
> > For the other, more social, goal of this step in fostering communication
> between
> > the projects and not using QA and/or Infra as a middle man I fully
> support. I
> > agree that we probably have too proxying going on between projects using
> QA
> > and/or infra instead of necessarily talking directly.
>
> Our current model leans far too much on the idea of the only time we
> ever try to test things for real is when we throw all 1 million lines of
> source code into one pot and stir. It really shouldn't be surprising how
> many bugs shake out there. And this is the wrong layer to debug from, so
> I firmly believe we need to change this back to something we can
> actually manage to shake the bugs out with. Because right now we're
> finding them, but our infrastructure isn't optimized for fixing them,
> and we need to change that.
>
>         -Sean
>
> --
> Sean Dague
> http://dague.net
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] Thoughts on the patch test failure rate and moving forward

Reply via email to