On Thu, Jul 10, 2014 at 11:56 AM, Sean Dague <s...@dague.net> wrote: > On 07/10/2014 09:48 AM, Matthew Treinish wrote: >> On Wed, Jul 09, 2014 at 09:16:01AM -0400, Sean Dague wrote: >>> I think we need to actually step back a little and figure out where we >>> are, how we got here, and what the future of validation might need to >>> look like in OpenStack. Because I think there has been some >>> communication gaps. (Also, for people I've had vigorous conversations >>> about this before, realize my positions have changed somewhat, >>> especially on separation of concerns.) >>> >>> (Also note, this is all mental stream right now, so I will not pretend >>> that it's an entirely coherent view of the world, my hope in getting >>> things down is we can come up with that coherent view of the wold together.) >>> >>> == Basic History == >>> >>> In the essex time frame Tempest was 70 tests. It was basically a barely >>> adequate sniff test for integration for OpenStack. So much so that our >>> first 3rd Party CI system, SmokeStack, used it's own test suite, which >>> legitimately found completely different bugs than Tempest. Not >>> surprising, Tempest was a really small number of integration tests. >>> >>> As we got to Grizzly Tempest had grown to 1300 tests, somewhat >>> organically. People were throwing a mix of tests into the fold, some >>> using Tempest's client, some using official clients, some trying to hit >>> the database doing white box testing. It had become kind of a mess and a >>> rorshack test. We had some really weird design summit sessions because >>> many people had only looked at a piece of Tempest, and assumed the rest >>> was like it. >>> >>> So we spent some time defining scope. Tempest couldn't really be >>> everything to everyone. It would be a few things: >>> * API testing for public APIs with a contract >>> * Some throughput integration scenarios to test some common flows >>> (these were expected to be small in number) >>> * 3rd Party API testing (because it had existed previously) >>> >>> But importantly, Tempest isn't a generic function test suite. Focus is >>> important, because Tempests mission always was highly aligned with what >>> eventually became called Defcore. Some way to validate some >>> compatibility between clouds. Be that clouds built from upstream (is the >>> cloud of 5 patches ago compatible with the cloud right now), clouds from >>> different vendors, public clouds vs. private clouds, etc. >>> >>> == The Current Validation Environment == >>> >>> Today most OpenStack projects have 2 levels of validation. Unit tests & >>> Tempest. That's sort of like saying your house has a basement and a >>> roof. For sufficiently small values of house, this is fine. I don't >>> think our house is sufficiently small any more. >>> >>> This has caused things like Neutron's unit tests, which actually bring >>> up a full wsgi functional stack and test plugins through http calls >>> through the entire wsgi stack, replicated 17 times. It's the reason that >>> Neutron unit tests takes many GB of memory to run, and often run longer >>> than Tempest runs. (Maru has been doing hero's work to fix much of this.) >>> >>> In the last year we made it *really* easy to get a devstack node of your >>> own, configured any way you want, to do any project level validation you >>> like. Swift uses it to drive their own functional testing. Neutron is >>> working on heading down this path. >>> >>> == New Challenges with New Projects == >>> >>> When we started down this path all projects had user APIs. So all >>> projects were something we could think about from a tenant usage >>> environment. Looking at both Ironic and Ceilometer, we really have >>> projects that are Admin API only. >>> >>> == Contracts or lack thereof == >>> >>> I think this is where we start to overlap with Eoghan's thread most. >>> Because branchless Tempest assumes that the test in Tempest are governed >>> by a stable contract. The behavior should only change based on API >>> version, not on day of the week. In the case that triggered this what >>> was really being tested was not an API, but the existence of a meter >>> that only showed up in Juno. >>> >>> Ceilometer is also another great instance of something that's often in a >>> state of huge amounts of stack tracing because it depends on some >>> internals interface in a project which isn't a contract. Or notification >>> formats, which aren't (largely) versioned. >>> >>> Ironic has a Nova driver in their tree, which implements the Nova driver >>> internals interface. Which means they depend on something that's not a >>> contract. It gets broken a lot. >>> >>> == Depth of reach of a test suite == >>> >>> Tempest can only reach so far into a stack given that it's levers are >>> basically public API calls. That's ok. But it means that things like >>> testing a bunch of different dbs in the gate (i.e. the postgresql job) >>> are pretty ineffectual. Trying to exercise code 4 levels deep through >>> API calls is like driving a rover on Mars. You can do it, but only very >>> carefully. >>> >>> == Replication == >>> >>> Because there is such a huge gap between unit tests, and Tempest tests, >>> replication of issues is often challenging. We have the ability to see >>> races in the gate due to volume of results, that don't show up for >>> developers very easily. When you do 30k runs a week, a ton of data falls >>> out of it. >>> >>> A good instance is the live snapshot bug. It was failing on about 3% of >>> Tempest runs, which means that it had about a 10% chance of killing a >>> patch on it's own. So it's definitely real. It's real enough that if we >>> enable that path, there are a ton of extra rechecks required by people. >>> However it's at a frequency that reproducing on demand is hard. And >>> reproducing with enough signal to make it debuggable is also hard. >>> >>> == The Fail Pit == >>> >>> All of which has somewhat led us to the fail pit. Where keeping >>> OpenStack in a state that it can actually pass Tempest consistently is a >>> full time job. It's actually more than a full time job, it's a full time >>> program. If it was it's own program it would probably be larger than 1/2 >>> the official programs in OpenStack. >>> >>> Also, when the Gate "program" is understaffed, it means that all the >>> rest of the OpenStack programs (possibly excepting infra and tripleo >>> because they aren't in the integrated gate) are slowed down >>> dramatically. That velocity loss has real community and people power >>> implications. >>> >>> This is especially true of people trying to get time, review, mentoring, >>> otherwise, out of the QA team. As there is kind of a natural overlap >>> with folks that actually want us to be able to merge code, so while the >>> Gate is under water, getting help on Tempest issues isn't going to >>> happen in any really responsive rate. >>> >>> Also, all the folks that have been the work horses here, myself, joe >>> gordon, matt treinish, matt riedemann, are pretty burnt out on this. >>> Every time we seem to nail one issue, 3 more crop up. Having no ending >>> in sight and spending all your time shoveling out other project bugs is >>> not a happy place to be. >>> >>> == New Thinking about our validation layers == >>> >>> I feel like an ideal world would be the following: >>> >>> 1. all projects have unit tests for their own internal testing, and >>> these pass 100% of the time (note, most projects have races in their >>> unit tests, and they don't pass 100% of the time. And they are low >>> priority to fix). >>> 2. all projects have a functional devstack job with tests *in their own >>> tree* that pokes their project in interesting ways. This is akin to what >>> neutron is trying and what swift is doing. These are *not* cogating. >> >> So I'm not sure that this should be a mandatory thing, but an opt-in. My real >> concern is the manpower, who is going to take the time to write all the test >> suites for all of the projects. I think it would be better to add that >> on-demand >> as the extra testing is required. That being said, I definitely view doing >> this >> as a good thing and something to be encouraged, because tempest won't be >> able to >> test everything. >> >> The other thing to also consider is duplicated effort between projects. For >> an >> example, look at the CLI tests in Tempest, the functional testing framework >> for >> testing CLI formatting was essentially the same between all the clients >> which is >> why they're in tempest. Under your proposal here, CLI tests should be moved >> back >> to the clients. But, would that mean we have a bunch of copy and pasted >> versions >> of the CLI test framework between all the project. >> >> I really want to avoid a situation where every project does the same basic >> testing differently just in a rush to spin up functional testing. I think >> coming >> up with a solution for a place with common test patterns and frameworks that >> can >> be maintained independently of all the projects and consumed for project >> specific testing is something we should figure out first. (I'm not sure oslo >> would be the right place for this necessarily) > > It would be simple enough to have a test framework library for that. > Realistically that could even co-gate with these tests. I think copy / > paste is completely solvable here. It would effectively be back door > libification of Tempest.
We could even add things to oslotest, the test framework library we already have. :-) > I don't think the manpower problem is being well solved in the current > model. And I think the difficulty in debugging failures because we are > missing these lower levels of testing which verify behavior in a more > contained situation. Which I think impacts people being excited to work > on this, or the level of skill needed to help. > >>> 3. all non public API contracts are shored up by landing contract tests >>> in projects. We did this recently with Ironic in Nova - >>> https://github.com/openstack/nova/blob/master/nova/tests/virt/test_ironic_api_contracts.py. >> >> So I think that the contract unit tests work well specifically for the ironic >> use case, but isn't a general solution. Mostly because the Nova driver api >> is an >> unstable interface and there is no reason for that to change. It's also a >> temporary thing because eventually the driver will be moved into Nova and >> then >> the only cross-project interaction between Ironic and Nova will be over the >> stable REST APIs. >> >> I think in general we should try to avoid doing non REST API cross-project >> communication. So hopefully there won't be more of these class of things, and >> if there are we can tackle them on a per case basis. But, even if it's a non >> REST API I don't think we should ever encourage or really allow any >> cross-project interactions over unstable interfaces. >> >> As a solution for notifications I'd rather see a separate notification >> white/grey (or any other monochrome shade) box test suite. If as a project we >> say that notifications have to be versioned for any change we can then >> enforce >> that easily with an external test suite that contains the definitions for all >> the notification. It then just makes a bunch of api calls and sits on RPC >> verifying the notification format. (or something of that ilk) >> >> I agree that normally whitebox testing needs to be tightly coupled with the >> data >> models in the projects, but I feel like notifications are slightly different. >> Mostly, because the basic format is the same between all the projects to make >> consumption simpler. So instead of duplicating the work to validate the >> notifications in all the projects it would be better to just implement it >> once. >> I also think tempest being an external audit on the API has been invaluable >> so >> enforcing that for notifications would have similar benefits. >> >> As an aside I think it would probably be fair if this was maintained as part >> of >> ceilometer or the telemetry program, since that's really all notifications >> are >> used for. (or least as AIUI) But, it would still be a co-gating test suite >> for >> anything that emits notifications. >> >>> >>> 4. all public API contracts are tested in Tempest (these are co-gating, >>> and ensure a contract breakage in keystone doesn't break swift). >>> >>> Out of these 4 levels, we currently have 2 (1 and 4). In some projects >>> we're making #1 cover 1 & 2. And we're making #4 cover 4, 3, and >>> sometimes 2. And the problem with this is it's actually pretty wasteful, >>> and when things fail, they fail so far away from the test, that the >>> reproduce is hard. >> >> I think the only real issue in your proposal is that the boundaries between >> all >> the test classifications aren't as well defined as they seem. I agree that >> having more intermediate classes of testing is definitely a good thing to do. >> Especially, since there is a great deal of hand waving on the details between >> what is being run in between tempest and unit tests. But, the issue as I see >> it >> is without guidelines on what type of tests belong where we'll end up with a >> bunch duplicated work. >> >> It's the same problem we have all the time in tempest, where we get a lot of >> patches that exceed the scope of tempest, despite it being arguably clearly >> outlined in the developer docs. But, the complexity is higher in this >> situation, >> because of having a bunch of different types of test suites that are >> available >> to add a new test to. I just think before we adopt #2 as mandatory it's >> important to have a better definition on the scope of the project specific >> functional testing. >> >>> >>> I actually think that if we went down this path we could actually make >>> Tempest smaller. For instance, negative API testing is something I'd say >>> is really #2. While these tests don't take a ton of time, they do add a >>> certain amount of complexity. It might also mean that admin tests, whose >>> side effects are hard to understand sometimes without white/greybox >>> interactions might migrated into #2. >> >> I think that negative testing is still part of tempest in your proposal. I >> still >> feel that the negative space of an API still is part of the contract, and >> should >> be externally validated. As part of tempest I think we need to revisit the >> negative space solution again, because I haven't seen much growth on the >> automatic test generation. We also can probably be way more targeted about >> what >> we're running, but I don't think punting on negative testing in tempest is >> something we should do. >> >> I actually think that testing on the admin api is doubly important because of >> the inadvertent side effects that they can cause. I think attempting to map >> that >> out is useful. (I don't think we can assume that the admin api is being used >> in >> a vacuum) I agree that in your proposal that tests for those weird >> interactions >> might be more fitting for #2. (to avoid more heisenbugs in tempest, etc) But, >> I'm on the fence about that. Mostly because I still think an admin api should >> conform to the api guidelines and thus needs tempest tests for that. I know >> you've expressed the opposite opinion about stability on the admin apis. >> But, I >> fail to see the distinction between an admin api and any other api when it >> comes >> to the stable api guarantees. >> >> For a real world example look at the default-quotas api which was probably >> the >> most recent example of this. (and I suspect why you mentioned it here :) ) >> The >> reason the test was added was because it was previously removed from nova, >> while >> horizon depended on it, which is exactly the kind of thing we should be using >> tempest for. (even under your proposal since it's a co-gating rest api issue) >> What's better about this example is that the test added had all the harms you >> outlined about a weird cross interactions from this extension and the other >> tests. I think when we weigh the complexity against the benefits for testing >> admin-apis in tempest there isn't a compelling reason to pull them out of >> tempest. But, as an alternative should start attempting to get clever about >> scheduling tests to avoid some strange cross interactions. >> >>> >>> I also think that #3 would help expose much more surgically what the >>> cross project pain points are instead of proxy efforts through Tempest >>> for these subtle issues. Because Tempest is probably a terrible tool to >>> discover that notifications in nova changed. The results is some weird >>> failure in a ceilometer test which says some instance didn't run when it >>> was expected, then you have to dig through 5 different openstack logs to >>> figure out that it was really a deep exception somewhere. If it was >>> logged, which it often isn't. (I actually challenge anyone to figure out >>> the reason for a ceilometer failure from a Tempest test based on it's >>> current logging. :) ) >> >> I agree that we should be directly testing the cross-project integration >> points >> which aren't REST APIs. I just don't think that we should decrease the level >> of >> api testing in tempest for something that consumes that integration point. I >> just feel if we ignore the top level too much we're going to expose more api >> bugs. What I think the real path forward is validating both separately. >> Hopefully, that'll let us catch bugs at each level independently. >> >>> >>> And by ensuring specific functionality earlier in the stack, and letting >>> Nova beat up Nova the way they think they should in a functional test >>> (or land a Neutron functional test to ensure that it's doing the right >>> thing), would make the Tempests runs which were cogating, a ton more >>> predictable. >>> >>> == Back to Branchless Tempest == >>> >>> I think the real issues that projects are running into with Branchless >>> Tempest is they are coming forward with tests not in class #4, which >>> fail, because while the same API existed 4 months ago as today, the >>> semantics of the project have changed in a non discoverable way. Which >>> I'd say was bad, however until we tried the radical idea of running the >>> API test suite against all releases that declared they had the same API, >>> we didn't see it. :) >>> >>> >>> Ok, that was a lot. Hopefully it was vaguely coherent. I want to preface >>> that I don't consider this all fully formed, but it's a lot of what's >>> been rattling around in my brain. >>> >> >> So here are some of my initial thoughts, I still need to stew some more on >> some >> of the details, so certain things may be more of a knee-jerk reaction and I >> might still be missing certain intricacies. Also a part of my response here >> is >> just me playing devil's advocate. I definitely think more testing is always >> better. I just want to make sure we're targeting the right things, because >> this >> proposal is pushing for a lot extra work for everyone. I want to make sure >> that >> before we commit to something this large that it's the right direction. > > A big part of the current challenge is that what we are trying to answer > is the following: > > "Does the proposed commit do what it believes it does, and does it avoid > regressions of behavior we believe should be." > > Right now our system (all parts partially to blame) is doing a very poor > job of answering that question. Because we're now incorrectly answering > that question more often than doing so correctly. Which causes other > issues because people are recheck grinding patches through that > *actually* cause bugs, but with Jenkins crying wolf so often people lose > patience and don't care. > > -Sean > > -- > Sean Dague > http://dague.net > > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev