Hi all! It's been a rough week from a CI and dev infrastructure perspective. I wanted to let people know what happened, what we did about it, and where it's going.
First of all, we had 4 different things happen all this week. (when it rains is pours) Rackspace Cloud changed the compute service name github https connections started hanging Ubuntu Oneiric updated their kernel Rackspace Cloud started handing us servers without networking I'll cover them one by one, but first, for those of you who don't know - we spin up cloud servers on demand on an account that Rackspace has provided. Normally, this is a great thing. Sometimes, being a cloud, it's shaky, and we do a number of things to guard against that, including pre-creating a pool of servers. Sometimes, the world conspires against us and all of that is for naught. As part of a longer term solution, we have an engineer working on completing a plugin for Jenkins that will handle all of the provisioning BEFORE test time - so that if we can't spin up nodes to run tests on, we'll simply queue up tests rather than cause failures. I'm mentioning that because you are all an audience of cloud engineers, so I don't want you to think we're not working on the real solution. However, that's still probably 3 or 4 weeks out from being finished, so in the meantime, we have to do this. Now, for the details: 1) Rackspace Cloud changed the compute service name The Cloud Servers API changed the service name this week from cloudServers to cloudServersLegacy. This caused libcloud, which is the basis of the scripts that we use to provision nodes for devstack integration tests, to fail, which meant that the job that spins up our pool of available servers wasn't able to replenish the pool. Once we identified the problem (with the help of the libcloud folks), we put in a local patch that uses both names until Rackspace rolled back the service name change. But there were several hours in there where we simply couldn't spin up servers. This was the basis of a large portion of yesterday's problems. 2) github https connections started hanging We had a few intermittent github outages this week. Normally this shouldn't be too much of a problem, but (lucky us) we uncovered a bug with the URL Change Trigger plugin for Jenkins that we were using. The bug was that it wasn't setting a TCP connect timeout, so if the TCP connection just hangs, the connect would hang. Still not a huge deal, right? WELL - we use that plugin as a part of a scheduled job, which runs inside of the Cron thread inside of Jenkins ... so what happened was that the TCP hang cause the thread to jam, which caused ALL jobs that ran off of a scheduled timer to just stop running. This is the reason for the exhaustion of devstack nodes Tuesday, Wednesday and Thursday. Once we figured out what was going on, we patched the problem, submitted it upstream and they made a new release, which we upgraded to yesterday... so we should not suffer from this problem again. Longer term, we're finishing work on a patch to the gerrit trigger plugin so that we can stop polling github for post-merge changes and instead just respond to merge events in gerrit. (ping me if you want to know why we need to write a patch for that) 3) Ubuntu Oneiric updated their kernel We're still working on the why of the breakage here. We update the base image we use for launching devstack nodes nightly, so that the spin up time is lower, but due to intermittent cloud issues, that hasn't been working properly for a few days. Last night it started working again, and the base image updated. Unfortunately, an update to Ubuntu itself left us without some headers that we need for iscsi to work properly. This borked up the stable/diablo branch testing for nova pretty hard. We've fixed this moving forward by explicitly adding the package that has the headers... the question of why it worked before the update is still under investigation. Longer term we're going to construct these test nodes in a different way, and we've discussed applying gating logic to them so that we don't add the new node base as a usuable node base until it's passed the trunk tests. (I'd personally like to do this with updating many of our depends, but there is some structure we need to chat about there, probably at the next ODS) As if that wasn't enough fun for one week: 4) Rackspace Cloud started handing us servers without networking Certainly not pointing fingers here - again, we're not really sure what's up with this one, but we started getting servers without working networking. The fix for this one is ostensibly simple, which is to test that the node we've spun up can actually take an ssh connection before we add it to the pool of available nodes. Again, once we're running with the jclouds plugin, jenkins will actually just keep trying to make nodes until it can ssh in to one, so this problem will also cease to be. Anyhow - sorry for the hiccups this week. We're trying to balance dealing with the fires as they happen with solving their root causes - so sometimes there's a little bit more of a lag before a fix than we'd like. Here's hoping that next week doesn't bring us quite as much fun. Have a great weekend! Monty _______________________________________________ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp