On 01/11/2014 11:38 AM, Sean Dague wrote: >> 3) (still testing) https://review.openstack.org/#/c/65805/ >> >> Right now when tempest runs in the devstack-gate jobs, it runs with >> concurrency=4 (run 4 tests at once). Unfortunately, it appears that >> this maxes out the deployment and results in timeouts (usually network >> related). >> >> This patch changes tempest concurrency to 2 instead of 4. The initial >> results are quite promising. The tests have been passing reliably so >> far, but we're going to continue to recheck this for a while longer for >> more data. >> >> One very interesting observation on this came from Jim where he said "A >> quick glance suggests 1.2x -- 1.4x change in runtime." If the >> deployment were *not* being maxed out, we would expect this change to >> result in much closer to a 2x runtime increase. > > We could also address this by locally turning up timeouts on operations > that are timing out. Which would let those things take the time they need. > > Before dropping the concurrency I'd really like to make sure we can > point to specific fails that we think will go away. There was a lot of > speculation around nova-network, however the nova-network timeout errors > only pop up on elastic search on large-ops jobs, not normal tempest > jobs. Definitely making OpenStack more idle will make more tests pass. > The Neutron team has experienced that. > > It would be a ton better if we could actually feed back a 503 with a > retry time (which I realize is a ton of work). > > Because if we decide we're now always pinned to only 2way, we have to > start doing some major rethinking on our test strategy, as we'll be way > outside the soft 45min time budget we've been trying to operate on. We'd > actually been planning on going up to 8way, but were waiting for some > issues to get fixed before we did that. It would sort of immediately put > a moratorium on new tests. If that's what we need to do, that's what we > need to do, but we should talk it through.
I can try to write up some detailed analysis on a few failures next week to help justify it, but FWIW, when I was looking this last week, I felt like making this change was going to fix a lot more than the nova-network timeout errors. If we can already tell this is going to improve reliability, both when using nova-network and neutron, then I think that should be enough to justify it. Taking longer seems acceptable if that comes with a more acceptable pass rate. Right now I'd like to see us set concurrency=2 while we work on the more difficult performance improvements to both neutron and nova-network, and we can turn it back up later on once we're able to demonstrate that it passes reliably without failures with a root cause of test load being too high. >> 5) https://review.openstack.org/#/c/65989/ >> >> This patch isn't a candidate for merging, but was written to test the >> theory that by updating nova-network to use conductor instead of direct >> database access, nova-network will be able to do work in parallel better >> than it does today, just as we have observed with nova-compute. >> >> Dan's initial test results from this are **very** promising. Initial >> testing showed a 20% speedup in runtime and a 33% decrease in CPU >> consumption by nova-network. >> >> Doing this properly will not be quick, but I'm hopeful that we can >> complete it by the Icehouse release. We will need to convert >> nova-network to use Nova's object model. Much of this work is starting >> to catch nova-network up on work that we've been doing in the rest of >> the tree but have passed on doing for nova-network due to nova-network >> being in a freeze. > > I'm a huge +1 on fixing this in nova-network. Of course. This is just a bit of a longer term effort. -- Russell Bryant _______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
