On 04/05/2012 01:22 AM, Justin Santa Barbara wrote: > I've got Compute functionality working with the OpenStack Jenkins > plugin, so it can launch nova instances as on-demand slaves now, run > builds on them, and archive the results into swift. I'd like to open > GitHub issues to track your requirements, but I have a few questions.
I shall do my best to elaborate... >> We need disposable machines that are only used for one test, which > means spinning up and terminating hundreds of machines per day. > > Sounds like we want a function to terminate the machine after the job > has run. > https://github.com/platformlayer/openstack-jenkins/issues/1 Yes. That seems sensible. >> We need to use machines from multiple providers simultaneously so that > we're resilient against errors with one provider. > > Label expressions should work here; you would apply a full set of axis > labels to each machine ("rax oneiric python26") but then you would > filter based only on the required axes ("oneric python26"). Are labels > sufficient for this? Labels are sufficient for tying the jobs to the specific resource description. I think the idea here is that we definitely want to be able to configure multiple cloud providers, and for each provider (in some manner) be able to configure what a machine labeled "oneiric" would look like. (likely as a combination of image, flavor and setup script) After that - honestly - as long as we can actually get an "oneiric" labeled machine from _someone_ when we ask for it, we're good. >> We need to pull nodes from a pool of machines that have been spun up > ahead of time for speed. > > This sounds like a custom NodeProvisioner implementation. The current > implementation is optimized around minimizing CPU hours, by doing load > prediction. You have a different criteria, based on minimizing launch > latency. It looks like it should be relatively easy to implement a new > algorithm, although perhaps a bit tricky to figure out how to plug it in. > > https://github.com/platformlayer/openstack-jenkins/issues/2 Yeah - average time to spin up a node and get it configured _when_it_works_ is between 5 and 10 minutes. devstack takes around that amount of time, so if we have to actually wait for the node to spin up each time, we'd be doubling the time it takes to test a change. Then there's the fact that clouds fail at giving us working node ALL THE TIME. So waiting for re-trys and such (although if it was happening at jenkins node provisioning time would be technically correct) could potentially lead to a terribly build queue! > >> We need to be able to select from different kinds of images for > certain tests. > > Are labels sufficient for this? Yes. Configuring the characteristics of an image and assigning a label to those characteristics will definitely let us associate tests with the right running environment. >> Machines need to be checked for basic functionality before being added > to the pool (we frequently get nodes without a functioning network). > > I believe Jenkins does this anyway; a node which doesn't have networking > won't be able to get the agent. And you can run your own scripts after > the slave boots up ("apt-get install openjdk", for example). Those > scripts can basically do any checks you want. Is that enough? Yes- just pointing out that it's a case that we have to deal with at the moment so it needs to be handled. >> They need to be started from snapshots with cached data on them to > avoid false negatives from network problems. > > Can you explain this a bit more? This is to protect against the apt > repositories / python sources / github repos being down? Would an http > proxy be enough? Yes. apt repositories, pypi and github are CONSTANTLY down, so we do a lot of work to pre-cache network fetched resources onto a machine so that running of the tests almost never have to involve a network fetch. (we've learned over the last year or so that any time a test system wants to fetch network resources, that the number of false-negatives due to github or pypi going away is unworkably high) It's possible that an http proxy _might_ help that - but the approach we've been taking so far is to have one process that spins up a node, does all the network fetching into local resources, and then snapshots that into an image which is the basis for subsequent node creation. The base image is updated nightly so that the amount of network update that has to happen at node instantiation time is minimized. jclouds itself (rather than the plugin) has a caching feature which does the auto-image creation based on node creation criteria. So if you combine the characteristics of a node (image, flavor, init-script, ram, volumes, etc) with a TTL, then the first time a node meeting those criteria is requested, it will create you one from scratch, but at the end of the user data script run, it will create an image snapshot which it can use for subsequent creation of nodes which match the same description. When we combine that with the idea of a pool of spun up nodes (also either currently or to-be implemented inside of jclouds itself, as having that capability has been a thing requested by a bunch of the current jclouds userbase) - then we get the pooling and image optimization that we're looking for (and currently doing in the python scripts of devstack-gate) pretty transparently. >> We need to keep them around after failures to diagnose problems, and > we need to delete those after a certain amount of time. > > From the github docs, it sounds like you don't get access anyway because > of the providers' policies. Would it not therefore be better to take a > ZIP or disk snapshot after a failed test, and then shut down the machine > as normal? Sometimes looking at the actual running state is nice. We currently keep them around for a bit and have the ability to manually inject a dev's keys on to the box on a one-off basis. We've used this ability a couple of times to get devs to help track down particularly odd or onerous problems. The policy decision is something I think we can (eventually) get - I just want to make sure we have the physical ability. That being said- we've _also_ considered that a disk or machine snapshot might also be a nice thing. If we get a provider which allows us to upload publically accessibly glance images, then we could do an image snapshot of the failed machine, upload it to glance and then tell the dev "here's the image id of your failed machine, spin one up on your own account if you want to troubleshoot" > > Also... > > You currently auto-update your images, which is cool > (devstack-update-vm-image). Thanks! We'd be _so_ dead if we didn't do that... > Do you think this is something a plugin > should do, or do you think this is better done through scripts and a > matrix job? I'm leaning towards keeping it in scripts. The one thing I > think we definitely need here is some sort of 'best match' image > launching, rather than hard-coding to a particular ID, so that the cloud > plugin will always pick up the newest image. > > https://github.com/platformlayer/openstack-jenkins/issues/3 Well - as I mentioned before, our current plan for removing those scripts is based on jclouds auto-imaging of NodeTemplate criteria. Hard-coding the ID is definitely a tricky thing to think about. Before I spoke with Adrian about his auto-caching stuff, my thoughts here were that the plugin should just generally have the ability to cut an image as a post-build step. If you have that, then you could have a matrix job which requested an image of a label that described the base image, say an "oneiric-base" label, and then that job would have a post-build step of "snapshot to image named "oneiric"" - then there would be a different job to actually run the tests which would use the normal oneiric label, which would be a machine that was spun up from the created image. Gotchas to handle there are what happens with failures in image creation... you don't want to fail in overwriting the oneiric image and leave yourself with nothing. Also handling that sensibly for multi-providers will be interesting. (Do you have a special job label for base oneiric on each provider? Like, rax-oneiric, and then a matrix job that did the image update job on rax-oneiric hp-oneiric and trystack-oneiric, then have the post-build step just be "snapshot to image id oneiric" - which would upload it to the provider that it was called from? I guess that would work...) Does that make sense? Thanks! Monty _______________________________________________ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp