On 06/13/2014 07:31 AM, Sean Dague wrote:
On 06/13/2014 02:36 AM, Mark McLoughlin wrote:
On Thu, 2014-06-12 at 22:10 -0400, Dan Prince wrote:
On Thu, 2014-06-12 at 08:06 -0400, Sean Dague wrote:
We're definitely deep into capacity issues, so it's going to be time to
start making tougher decisions about things we decide aren't different
enough to bother testing on every commit.
In order to save resources why not combine some of the jobs in different
ways. So for example instead of:

  check-tempest-dsvm-full
  check-tempest-dsvm-postgres-full

Couldn't we just drop the postgres-full job and run one of the Neutron
jobs w/ postgres instead? Or something similar, so long as at least one
of the jobs which runs most of Tempest is using PostgreSQL I think we'd
be mostly fine. Not shooting for 100% coverage for everything with our
limited resource pool is fine, lets just do the best we can.

Ditto for gate jobs (not check).
I think that's what Clark was suggesting in:

https://etherpad.openstack.org/p/juno-test-maxtrices

Previously we've been testing Postgresql in the gate because it has a
stricter interpretation of SQL than MySQL. And when we didn't test
Postgresql it regressed. I know, I chased it for about 4 weeks in grizzly.

However Monty brought up a good point at Summit, that MySQL has a strict
mode. That should actually enforce the same strictness.

My proposal is that we land this change to devstack -
https://review.openstack.org/#/c/97442/ and backport it to past devstack
branches.

Then we drop the pg jobs, as the differences between the 2 configs
should then be very minimal. All the *actual* failures we've seen
between the 2 were completely about this strict SQL mode interpretation.

I suppose I would like to see us keep it in the mix. Running SmokeStack
for almost 3 years I found many an issue dealing w/ PostgreSQL. I ran it
concurrently with many of the other jobs and I too had limited resources
(much less that what we have in infra today).

Would MySQL strict SQL mode catch stuff like this (old bugs, but still
valid for this topic I think):

  https://bugs.launchpad.net/nova/+bug/948066

  https://bugs.launchpad.net/nova/+bug/1003756


Having support for and testing against at least 2 databases helps keep
our SQL queries and migrations cleaner... and is generally a good
practice given we have abstractions which are meant to support this sort
of thing anyway (so by all means let us test them!).

Also, Having compacted the Nova migrations 3 times now I found many
issues by testing on multiple databases (MySQL and PostgreSQL). I'm
quite certain our migrations would be worse off if we just tested
against the single database.
Certainly sounds like this testing is far beyond the "might one day be
useful" level Sean talks about.
The migration compaction is a good point. And I'm happy to see there
were some bugs exposed as well.

Here is where I remain stuck....

We are now at a failure rate in which it's 3 days (minimum) to land a
fix that decreases our failure rate at all.

The way we are currently solving this is by effectively building "manual
zuul" and taking smart humans in coordination to end run around our
system. We've merged 18 fixes so far -
https://etherpad.openstack.org/p/gatetriage-june2014 this way. Merging a
fix this way is at least an order of magnitude more expensive on people
time because of the analysis and coordination we need to go through to
make sure these things are the right things to jump the queue.

That effort, over 8 days, has gotten us down to *only* a 24hr merge
delay. And there are no more smoking guns. What's left is a ton of
subtle things. I've got ~ 30 patches outstanding right now (a bunch are
things to clarify what's going on in the build runs especially in the
fail scenarios). Every single one of them has been failed by Jenkins at
least once. Almost every one was failed by a different unique issue.

So I'd say at best we're 25% of the way towards solving this. That being
said, because of the deep queues, people are just recheck grinding (or
hitting the jackpot and landing something through that then fails a lot
after landing). That leads to bugs like this:

https://bugs.launchpad.net/heat/+bug/1306029

Which was seen early in the patch - https://review.openstack.org/#/c/97569/

Then kind of destroyed us completely for a day -
http://status.openstack.org/elastic-recheck/ (it's the top graph).

And, predictably, a week into a long gate queue everyone is now grumpy.
The sniping between projects, and within projects in assigning blame
starts to spike at about day 4 of these events. Everyone assumes someone
else is to blame for these things.

So there is real community impact when we get to these states.

....

So, I'm kind of burnt out trying to figure out how to get us out of
this. As I do take it personally when we as a project can't merge code.
As that's a terrible state to be in.

Pleading to get more people to dive in, is mostly not helping.

So my only thinking at this point is we prune back our test jobs to the
point that they are a small enough number of configurations that the
fixed number of people actually trying to debug this, actually can.

If there are other ideas, that's great.
I have another idea, which many will not like, but here goes anyway. First, I don't think we can thank Sean enough for his determination and skill in trying to deal with this. The fact that he has hit the wall and that others have not felt able to help indicates a real need to change something. I do not believe our current methodology can scale, and it already hasn't.

IMO we are suffering from race bugs combined with a "branch" model that focuses on integrating each patch into a 1.7 million line (or so it is said) code base as quickly as possible. We do enough testing in the gate so that race bugs cause merges to fail often, but not enough testing on each commit to keep race bugs out. If we actually tried to test enough in the gate to keep races out, we would never merge anything and we would not have the resources to do it anyway. Making it worse, a new race bug that slips through could have come from anywhere making it very difficult for people to help figure them out.

There is a different way to do this. We could adopt the same methodology we have now around gating, but applied to each project on its own branch. These project branches would be integrated into master at some frequency or when some new feature in project X is needed by project Y. Projects would want to pull from the master branch often, but the push process would be less frequent and run a much larger battery of tests than we do now. Doing this would have the following advantages:

1. It would be much harder for a race bug to get in. Each commit would be tested many more times on its branch before being merged to master than at present, including tests specialized for that project. The qa/infra teams and others would continue to define acceptance at the master level. 2. If a race bug does get in, projects have at least some chance to avoid merging the bad code. 3. Each project can develop its own gating policy for its own branch tailored to the issues and tradeoffs it has. This includes focus on spending time running their own tests. We would no longer run a complete battery of nova tests on every commit to swift.
4. If a project branch gets into the situation we are now in:
     a) it does not impact the ability of other projects to merge code
b) it is highly likely the bad code is actually in the project so it is known who should help fix it c) those trying to fix it will be domain experts in the area that is failing 5. Distributing the gating load and policy to projects makes the whole system much more scalable as we add new projects.

Of course there are some drawbacks:

1. It will take longer, sometimes much longer, for any individual commit to make it to master. Of course if a super-serious issue made it to master and had to be fixed immediately it could be committed to master directly. 2. Branch management at the project level would be required. Projects would have to decide gating criteria, timing of pulls, and coordinate around integration to master with other projects. 3. There may be some technical limitations with git/gerrit/whatever that I don't understand but which would make this difficult.
4. It makes the whole thing more complicated from a process standpoint.

I have used this model in previous large software projects and it worked quite well. This may also be somewhat similar to what the linux kernel does in some ways. This is not an actual proposal, since many details would have to be worked out, but I believe adopting a methodology like this would actually make a big dent in both the resource limitation issues and the race to the bottom causing the current pain. And it would do so in a way that was scalable even as we grow enough projects to beyond the ability of any one person or team to have a handle on.

 -David

But 'you aren't allowed to do less' isn't really sustainable. That just
leads to people giving up on helping.

        -Sean



_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to