Monty Taylor <mord...@inaugust.com> writes: > The infra team has done a lot of work in prep for our favorite time of > year, and we've actually landed several upgrades to the gate without > which we'd be in particularly bad shape right now. (I'll let Jim write > about some of them later when he's not battling the current operational > issues - they're pretty spectacular) As with many scaling issues, some > of these upgrades have resulted in moving the point of pain further > along the stack. We're working on solutions to the current pain points. > (Or, I should say they are, because I'm on a plane headed to Burning Man > and not useful for much other than writing emails.)
Hi! The good news is that a lot of the operational problems over the past few days have been corrected, we are now pretty close to the noise floor of infrastructure issues in the gate, and over the next few days we'll work to get rid of the remaining bugs. As I'm sure everyone knows, we've seen a huge growth in the project, the number of changes, and the number of tests we run. That is both wonderful, and a little terrifying! But we haven't been idle: we have made some significant improvements and innovations to the project infrastructure to deal with our growing load, especially during these peak times. About a year ago, we realized that the growing number of jobs run (and number of test machines on which we run those jobs) was going to cause scaling issues with Jenkins. So with the help of Khai Do, we created the gearman-plugin[1] for Jenkins, and then we modified Zuul to use it. That means that Zuul isn't directly tied to Jenkins anymore, and can distribute the jobs it needs to run to anything that can run them via Gearman. A few weeks ago we took advantage of that by adding two new Jenkins masters to our system, giving us one of the first (if not the first) multi-master Jenkins systems. Since then, all of the test jobs have been run on nodes attached to either jenkins01.openstack.org or jenkins02.openstack.org (which you may have seen linked to from the Zuul status page). That has given us the ability to upgrade Jenkins and its plugins with no interruption due to the active-active nature of the system. And we can add hundreds of test nodes to each of these systems and continue to scale them horizontally as our load increases. With Jenkins now able to scale, the next bottleneck was the number of test nodes. Until recently, we had a handful of special Jenkins jobs which would launch and destroy the single-use nodes that are used for devstack tests. We were seeing issues with Jenkins running those jobs, as well as their ability to keep up with demand. So we started the Nodepool project[2] to create a daemon that could keep up with the demand for test nodes, be much more responsive, and eliminate some of the occasional errors that we would see in the old Rube-Goldberg system we had for managing nodes. In anticipation of the rush of patches for the feature freeze, we rolled that out over the weekend so it was ready to go Monday. And it worked! In fact, it's extremely responsive. It immediately utilized our entire capacity to supply test nodes. Which was great, except that a lot of our tests are configured to use the git repos from Gerrit, which is why Gerrit was very slow early in the week. Fortunately, Elizabeth Krumbach Joseph has been working on setting up a new Git server. That alone is pretty exciting, and she's going to send an announcement about it soon. Since it was ready to go, we moved the test load from Gerrit to the new git server, which has made Gerrit much more responsive again. Unfortunately, the new git server still wasn't quite able to keep up with the test load, so Clark Boylan, Elizabeth and I have spent some time tuning it as well as load-balancing it across several hosts. That is now in place, and the new system seems able to cope with the load from the current rush of patches. We're still seeing an occasional issue where a job is reported as LOST because Jenkins is apparently unaware that it can't talk to the test node. We have some workarounds in progress that we hope to have in place soon. Our goal is to have the most robust and accurate test system possible, that can run all of the tests we can think to throw at it. I think the improvements we've made recently are going to help tremendously and I'm pretty excited! As always, if you'd like to pitch in, stop by #openstack-infra on Freenode and see what we're up to. -Jim [1] http://git.openstack.org/cgit/openstack-infra/gearman-plugin/ [2] http://git.openstack.org/cgit/openstack-infra/nodepool/ _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev