Hi All, Yesterday the final patch merged to run CI jobs on RH2, and last night we merged the patch to tripleo-ci to support RH2 jobs. So we now have a new job (gate-tripleo-ci-centos-7-ovb-ha) running on all tripleo patch reviews. This job is running pacemaker HA with a 3 node controller and a single compute node. Its basically the same as our current HA job without net-iso.
Looking at pass rates this morning 1. The jobs are failing on stable branches[1] o I've submitted a patch to the mitaka and liberty branch to fix this(see the bug) 2. The pass rate does seem to be a little lower then the RH1 HA job o I'll look into this today but overall the pass rate should be good enough for when RH1 is taken offline The main difference between jobs running on rh2 when compared to rh1 is that the CI slave IS the undercloud (we've eliminated the need for an extra undercloud node), this saves resources. We no longer build a instack qcow2 image, this saves us a little time. To make this work, early in the CI process we make a call out to a geard broker and pass it the instance ID of the undercloud, this broker creates a heat stack (using OVB heat templates) with a number nodes on a provisioning network. It then attaches an interface on this provisioning network to the undercloud[2]. Ironic can then talk (with ipmi) to a bmc node to power them on and PXE boot them. At the end of the job the stack is deleted. Whats next? o On Tuesday evening next, rh1 will be taken offline so I'll be submitting a patch to remove all of the RH1 jobs and until we bring it back up we will only have a single triple-ci job o The RH1 rack will be back available to us on Thursday, we then have a choice 1. Bring rh1 back up as is and return everything back to the status quo 2. Redeploy rh1 with OVB and move away from the legacy system permanently If the OVB based jobs prove to be reliable etc.. I think option 2 is worth thinking about, it wasn't the original plan but it would allow us move away from a legacy system that is getting harder to support as time goes on. o RH2 was a loaned to us to allow this to happen so once we pick either option above and complete the deployment of RH1 we'll have to give it back The OVB based cloud opens up a couple of interesting options to us that we can explore if we were to stick with using OVB 1. Periodic scale test o With OVB its possible to select the number of nodes we place on the provisioning network, for example while testing rh2 I was able to deploy a overcloud with 80(we could do up to 120 on rh2 even higher on rh1) compute nodes, doing this nightly when CI load is low would be an extremely valuable test to run and gather data on. 2. Dev quota to reproduce CI o On OVB its now a lot easier to give somebody some quota to reproduce exactly what CI is using in order to reproduce problems etc... this was possible on rh1 but required a cloud admin to manually take testenvs away from CI(it was manual and messy so we didn't do it much) The move doesn't come without its costs 1. tripleo-quickstart o Part of the tripleo-quickstart install is to first download a prebuilt undercloud image that we were building in our periodic job. Because the undercloud is now the CI slave we no longer build a instack.qcow2 image. For the near future we can host the most recent one on RH2(the IP will change so this needs to change in tripleo quickstart or better still a DNS entry could be used so switch over would be smother in future) but if we make the move to jobs of this type permanent we'll no longer be generating this image for quickstart. So we'll have to see if we can come to an alternative. We could generate one in the periodic job but I'm not sure how we could test it easily. 2. moving the current-tripleo pin o I havn't put in place yet anything needed for our periodic job to move the current-tripleo pin, so until we get this done (and decide what to do about 1. above) we're stuck on what ever pin we happen to be on on Tuesday when rh1 is taken offline. The pin moved last night to a repository from 2016-06-29 so we are at least reasonably up to date. If it looks like the rh1 deployment is going to take an excessive amount time we'll need to make this a priority. 3. The ability to telnet to CI slaves to get the console for running CI jobs doesn't work on RH2 jobs, this is because its is using the same port number(8088) we use in tripleo for ironic to serve its iPXE images over http. So I've had to kill the console serving process until we solve this. If we want to fix this we'll have to explore changing the port number in either tripleo or infra. I was putting together a screencast of how rh2 was deployed(with RDO mitaka) but after several hours of editing the screen casts into something usable the software I was using(openshot) refused to generate what I had put together, in fact it crashed a lot, so if anybody has any good suggestions of software I could use I'll try again. If I've missed anything please feel free to ask, thanks, Derek. [1] - https://bugs.launchpad.net/tripleo/+bug/1598089 [2] - http://git.openstack.org/cgit/openstack-infra/tripleo-ci/tree/scripts/te-broker/create-env __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev