On 08/22/2013 10:37 PM, Dolph Mathews wrote: > > On Thu, Aug 22, 2013 at 7:48 PM, James E. Blair <jebl...@openstack.org > <mailto:jebl...@openstack.org>> wrote: > > Monty Taylor <mord...@inaugust.com <mailto:mord...@inaugust.com>> > writes: > > > The infra team has done a lot of work in prep for our favorite time of > > year, and we've actually landed several upgrades to the gate without > > which we'd be in particularly bad shape right now. (I'll let Jim write > > about some of them later when he's not battling the current > operational > > issues - they're pretty spectacular) As with many scaling issues, some > > of these upgrades have resulted in moving the point of pain further > > along the stack. We're working on solutions to the current pain > points. > > (Or, I should say they are, because I'm on a plane headed to > Burning Man > > and not useful for much other than writing emails.) > > Hi! > > The good news is that a lot of the operational problems over the past > few days have been corrected, we are now pretty close to the noise floor > of infrastructure issues in the gate, and over the next few days we'll > work to get rid of the remaining bugs. > > As I'm sure everyone knows, we've seen a huge growth in the project, the > number of changes, and the number of tests we run. That is both > wonderful, and a little terrifying! But we haven't been idle: we have > made some significant improvements and innovations to the project > infrastructure to deal with our growing load, especially during these > peak times. > > About a year ago, we realized that the growing number of jobs run (and > number of test machines on which we run those jobs) was going to cause > scaling issues with Jenkins. So with the help of Khai Do, we created > the gearman-plugin[1] for Jenkins, and then we modified Zuul to use it. > That means that Zuul isn't directly tied to Jenkins anymore, and can > distribute the jobs it needs to run to anything that can run them via > Gearman. > > A few weeks ago we took advantage of that by adding two new Jenkins > masters to our system, giving us one of the first (if not the first) > multi-master Jenkins systems. Since then, all of the test jobs have > been run on nodes attached to either jenkins01.openstack.org > <http://jenkins01.openstack.org> or > jenkins02.openstack.org <http://jenkins02.openstack.org> (which you > may have seen linked to from the Zuul > status page). That has given us the ability to upgrade Jenkins and its > plugins with no interruption due to the active-active nature of the > system. And we can add hundreds of test nodes to each of these systems > and continue to scale them horizontally as our load increases. > > With Jenkins now able to scale, the next bottleneck was the number of > test nodes. Until recently, we had a handful of special Jenkins jobs > which would launch and destroy the single-use nodes that are used for > devstack tests. We were seeing issues with Jenkins running those jobs, > as well as their ability to keep up with demand. So we started the > Nodepool project[2] to create a daemon that could keep up with the > demand for test nodes, be much more responsive, and eliminate some of > the occasional errors that we would see in the old Rube-Goldberg system > we had for managing nodes. > > In anticipation of the rush of patches for the feature freeze, we rolled > that out over the weekend so it was ready to go Monday. And it worked! > > In fact, it's extremely responsive. It immediately utilized our entire > capacity to supply test nodes. Which was great, except that a lot of > our tests are configured to use the git repos from Gerrit, which is why > Gerrit was very slow early in the week. Fortunately, Elizabeth Krumbach > Joseph has been working on setting up a new Git server. That alone is > pretty exciting, and she's going to send an announcement about it soon. > Since it was ready to go, we moved the test load from Gerrit to the new > git server, which has made Gerrit much more responsive again. > Unfortunately, the new git server still wasn't quite able to keep up > with the test load, so Clark Boylan, Elizabeth and I have spent some > time tuning it as well as load-balancing it across several hosts. > > That is now in place, and the new system seems able to cope with the > load from the current rush of patches. > > We're still seeing an occasional issue where a job is reported as LOST > because Jenkins is apparently unaware that it can't talk to the test > node. We have some workarounds in progress that we hope to have in > place soon. > > Our goal is to have the most robust and accurate test system possible, > that can run all of the tests we can think to throw at it. I think the > improvements we've made recently are going to help tremendously and I'm > pretty excited! As always, if you'd like to pitch in, stop by > #openstack-infra on Freenode and see what we're up to. > > > Wow, nice work! Thank you, infra!
+1000 I am continually amazed by the work you guys do. It has been a key factor in our ability to move so fast. Thanks for everything! -- Russell Bryant _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev