> I think we need to actually step back a little and figure out where we > are, how we got here, and what the future of validation might need to > look like in OpenStack. Because I think there has been some > communication gaps. (Also, for people I've had vigorous conversations > about this before, realize my positions have changed somewhat, > especially on separation of concerns.) > > (Also note, this is all mental stream right now, so I will not pretend > that it's an entirely coherent view of the world, my hope in getting > things down is we can come up with that coherent view of the wold together.) > > == Basic History == > > In the essex time frame Tempest was 70 tests. It was basically a barely > adequate sniff test for integration for OpenStack. So much so that our > first 3rd Party CI system, SmokeStack, used it's own test suite, which > legitimately found completely different bugs than Tempest. Not > surprising, Tempest was a really small number of integration tests. > > As we got to Grizzly Tempest had grown to 1300 tests, somewhat > organically. People were throwing a mix of tests into the fold, some > using Tempest's client, some using official clients, some trying to hit > the database doing white box testing. It had become kind of a mess and a > rorshack test. We had some really weird design summit sessions because > many people had only looked at a piece of Tempest, and assumed the rest > was like it. > > So we spent some time defining scope. Tempest couldn't really be > everything to everyone. It would be a few things: > * API testing for public APIs with a contract > * Some throughput integration scenarios to test some common flows > (these were expected to be small in number) > * 3rd Party API testing (because it had existed previously) > > But importantly, Tempest isn't a generic function test suite. Focus is > important, because Tempests mission always was highly aligned with what > eventually became called Defcore. Some way to validate some > compatibility between clouds. Be that clouds built from upstream (is the > cloud of 5 patches ago compatible with the cloud right now), clouds from > different vendors, public clouds vs. private clouds, etc. > > == The Current Validation Environment == > > Today most OpenStack projects have 2 levels of validation. Unit tests & > Tempest. That's sort of like saying your house has a basement and a > roof. For sufficiently small values of house, this is fine. I don't > think our house is sufficiently small any more. > > This has caused things like Neutron's unit tests, which actually bring > up a full wsgi functional stack and test plugins through http calls > through the entire wsgi stack, replicated 17 times. It's the reason that > Neutron unit tests takes many GB of memory to run, and often run longer > than Tempest runs. (Maru has been doing hero's work to fix much of this.) > > In the last year we made it *really* easy to get a devstack node of your > own, configured any way you want, to do any project level validation you > like. Swift uses it to drive their own functional testing. Neutron is > working on heading down this path. > > == New Challenges with New Projects == > > When we started down this path all projects had user APIs. So all > projects were something we could think about from a tenant usage > environment. Looking at both Ironic and Ceilometer, we really have > projects that are Admin API only. > > == Contracts or lack thereof == > > I think this is where we start to overlap with Eoghan's thread most. > Because branchless Tempest assumes that the test in Tempest are governed > by a stable contract. The behavior should only change based on API > version, not on day of the week. In the case that triggered this what > was really being tested was not an API, but the existence of a meter > that only showed up in Juno. > > Ceilometer is also another great instance of something that's often in a > state of huge amounts of stack tracing because it depends on some > internals interface in a project which isn't a contract. Or notification > formats, which aren't (largely) versioned. > > Ironic has a Nova driver in their tree, which implements the Nova driver > internals interface. Which means they depend on something that's not a > contract. It gets broken a lot. > > == Depth of reach of a test suite == > > Tempest can only reach so far into a stack given that it's levers are > basically public API calls. That's ok. But it means that things like > testing a bunch of different dbs in the gate (i.e. the postgresql job) > are pretty ineffectual. Trying to exercise code 4 levels deep through > API calls is like driving a rover on Mars. You can do it, but only very > carefully. > > == Replication == > > Because there is such a huge gap between unit tests, and Tempest tests, > replication of issues is often challenging. We have the ability to see > races in the gate due to volume of results, that don't show up for > developers very easily. When you do 30k runs a week, a ton of data falls > out of it. > > A good instance is the live snapshot bug. It was failing on about 3% of > Tempest runs, which means that it had about a 10% chance of killing a > patch on it's own. So it's definitely real. It's real enough that if we > enable that path, there are a ton of extra rechecks required by people. > However it's at a frequency that reproducing on demand is hard. And > reproducing with enough signal to make it debuggable is also hard. > > == The Fail Pit == > > All of which has somewhat led us to the fail pit. Where keeping > OpenStack in a state that it can actually pass Tempest consistently is a > full time job. It's actually more than a full time job, it's a full time > program. If it was it's own program it would probably be larger than 1/2 > the official programs in OpenStack. > > Also, when the Gate "program" is understaffed, it means that all the > rest of the OpenStack programs (possibly excepting infra and tripleo > because they aren't in the integrated gate) are slowed down > dramatically. That velocity loss has real community and people power > implications. > > This is especially true of people trying to get time, review, mentoring, > otherwise, out of the QA team. As there is kind of a natural overlap > with folks that actually want us to be able to merge code, so while the > Gate is under water, getting help on Tempest issues isn't going to > happen in any really responsive rate. > > Also, all the folks that have been the work horses here, myself, joe > gordon, matt treinish, matt riedemann, are pretty burnt out on this. > Every time we seem to nail one issue, 3 more crop up. Having no ending > in sight and spending all your time shoveling out other project bugs is > not a happy place to be. > > == New Thinking about our validation layers == > > I feel like an ideal world would be the following: > > 1. all projects have unit tests for their own internal testing, and > these pass 100% of the time (note, most projects have races in their > unit tests, and they don't pass 100% of the time. And they are low > priority to fix). > 2. all projects have a functional devstack job with tests *in their own > tree* that pokes their project in interesting ways. This is akin to what > neutron is trying and what swift is doing. These are *not* cogating. > 3. all non public API contracts are shored up by landing contract tests > in projects. We did this recently with Ironic in Nova - > https://github.com/openstack/nova/blob/master/nova/tests/virt/test_ironic_api_contracts.py. > > 4. all public API contracts are tested in Tempest (these are co-gating, > and ensure a contract breakage in keystone doesn't break swift). > > Out of these 4 levels, we currently have 2 (1 and 4). In some projects > we're making #1 cover 1 & 2. And we're making #4 cover 4, 3, and > sometimes 2. And the problem with this is it's actually pretty wasteful, > and when things fail, they fail so far away from the test, that the > reproduce is hard. > > I actually think that if we went down this path we could actually make > Tempest smaller. For instance, negative API testing is something I'd say > is really #2. While these tests don't take a ton of time, they do add a > certain amount of complexity. It might also mean that admin tests, whose > side effects are hard to understand sometimes without white/greybox > interactions might migrated into #2. > > I also think that #3 would help expose much more surgically what the > cross project pain points are instead of proxy efforts through Tempest > for these subtle issues. Because Tempest is probably a terrible tool to > discover that notifications in nova changed. The results is some weird > failure in a ceilometer test which says some instance didn't run when it > was expected, then you have to dig through 5 different openstack logs to > figure out that it was really a deep exception somewhere. If it was > logged, which it often isn't. (I actually challenge anyone to figure out > the reason for a ceilometer failure from a Tempest test based on it's > current logging. :) ) > > And by ensuring specific functionality earlier in the stack, and letting > Nova beat up Nova the way they think they should in a functional test > (or land a Neutron functional test to ensure that it's doing the right > thing), would make the Tempests runs which were cogating, a ton more > predictable. > > == Back to Branchless Tempest == > > I think the real issues that projects are running into with Branchless > Tempest is they are coming forward with tests not in class #4, which > fail, because while the same API existed 4 months ago as today, the > semantics of the project have changed in a non discoverable way. Which > I'd say was bad, however until we tried the radical idea of running the > API test suite against all releases that declared they had the same API, > we didn't see it. :) > > > Ok, that was a lot. Hopefully it was vaguely coherent. I want to preface > that I don't consider this all fully formed, but it's a lot of what's > been rattling around in my brain.
Thanks for the very detailed response Sean. There's a lot in there, some of it background, some of it more focussed on the what-next question. I'll need to take a bit of time to digest all of that, and also discuss at the weekly ceilometer meeting tomorrow. I'll circle back with a more complete response after that. Cheers, Eoghan > -Sean > > On 07/09/2014 05:41 AM, Eoghan Glynn wrote: > > > > TL;DR: branchless Tempest shouldn't impact on backporting policy, yet > > makes it difficult to test new features not discoverable via APIs > > > > Folks, > > > > At the project/release status meeting yesterday[1], I raised the issue > > that featureful backports to stable are beginning to show up[2], purely > > to facilitate branchless Tempest. We had a useful exchange of views on > > IRC but ran out of time, so this thread is intended to capture and > > complete the discussion. > > > > The issues, as I see it, are: > > > > * Tempest is expected to do double-duty as both the integration testing > > harness for upstream CI and as a tool for externally probing > > capabilities > > in public clouds > > > > * Tempest has an implicit bent towards pure API tests, yet not all > > interactions between OpenStack services that we want to test are > > mediated by APIs > > > > * We don't have another integration test harness other than Tempest > > that we could use to host tests that don't just make assertions > > about the correctness/presence of versioned APIs > > > > * We want to be able to add new features to Juno, or fix bugs of > > omission, in ways that aren't necessarily discoverable in the API; > > without backporting these patches to stable if we wouldn't have > > done so under the normal stable-maint policy[3] > > > > * Integrated projects are required[4] to provide Tempest coverage, > > so the rate of addition of tests to Tempest is unlikely to slow > > down anytime soon > > > > So the specific type of test that I have in mind would be common > > for Ceilometer, but also possibly for Ironic and others: > > > > 1. an end-user initiates some action via an API > > (e.g. calls the cinder snapshot API) > > > > 2. this initiates some actions behind the scenes > > (e.g. a volume is snapshot'd and a notification emitted) > > > > 3. the test reasons over some expected side-effect > > (e.g. some metering data shows up in ceilometer) > > > > The branchless Tempest spec envisages new features will be added and > > need to be skipped when testing stable/previous, but IIUC requires > > that the presence of new behaviors is externally discoverable[5]. > > > > One approach mooted for allowing these kind of scenarios to be tested > > was to split off the pure-API aspects of Tempest so that it can be used > > for probing public-cloud-capabilities as well as upstream CI, and then > > build project-specific mini-Tempests to test integration with other > > projects. > > > > Personally, I'm not a fan of that approach as it would require a lot > > of QA expertise in each project, lead to inefficient use of CI > > nodepool resources to run all the mini-Tempests, and probably lead to > > a divergent hotchpotch of per-project approaches. > > > > Another idea would be to keep all tests in Tempest, while also > > micro-versioning the services such that tests can be skipped on the > > basis of whether a particular feature-adding commit is present. > > > > When this micro-versioning can't be discovered by the test (as in the > > public cloud capabilities probing case), those tests would be skipped > > anyway. > > > > The final, less palatable, approach that occurs to me would be to > > revert to branchful Tempest. > > > > Any other ideas, or preferences among the options laid out above? > > > > Cheers, > > Eoghan > > > > [1] > > http://eavesdrop.openstack.org/meetings/project/2014/project.2014-07-08-21.03.html > > [2] https://review.openstack.org/104863 > > [3] https://wiki.openstack.org/wiki/StableBranch#Appropriate_Fixes > > [4] > > https://github.com/openstack/governance/blob/master/reference/incubation-integration-requirements.rst#qa-1 > > [5] > > https://github.com/openstack/qa-specs/blob/master/specs/implemented/branchless-tempest.rst#scenario-1-new-tests-for-new-features > > > > _______________________________________________ > > OpenStack-dev mailing list > > OpenStack-dev@lists.openstack.org > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > -- > Sean Dague > http://dague.net > > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev