TL;DR: we would like to change the way HA is tested upstream to avoid being hitten by evitable bugs that the CI process should discover.
Long version: Today HA testing in upstream consist only in verifying that a three controllers setup comes up correctly and can spawn an instance. That's something, but it’s far from being enough since we continuously see "day two" bugs. We started covering this more than a year ago in internal CI and today also on rdocloud using a project named tripleo-quickstart-utils [1]. Apart from his name, the project is not limited to tripleo-quickstart, it covers three principal roles: 1 - stonith-config: a playbook that can be used to automate the creation of fencing devices in the overcloud; 2 - instance-ha: a playbook that automates the seventeen manual steps needed to configure instance HA in the overcloud, test them via rally and verify that instance HA works; 3 - validate-ha: a playbook that runs a series of disruptive actions in the overcloud and verifies it always behaves correctly by deploying a heat-template that involves all the overcloud components; To make this usable upstream, we need to understand where to put this code. Here some choices: 1 - tripleo-validations: the most logical place to put this, at least looking at the name, would be tripleo-validations. I've talked with some of the folks working on it, and it came out that the meaning of tripleo-validations project is not doing disruptive tests. Integrating this stuff would be out of scope. 2 - tripleo-quickstart-extras: apart from the fact that this is not something meant just for quickstart (the project supports infrared and "plain" environments as well) even if we initially started there, in the end, it came out that nobody was looking at the patches since nobody was able to verify them. The result was a series of reviews stuck forever. So moving back to extras would be a step backward. 3 - Dedicated project (tripleo-ha-utils or just tripleo-utils): like for tripleo-upgrades or tripleo-validations it would be perfect having all this grouped and usable as a standalone thing. Any integration is possible inside the playbook for whatever kind of test. Today we're using the bash framework to interact with the cluster, rally to test instance-ha and Ansible itself to simulate full power outage scenarios. There's been a lot of talk about this during the last PTG [2], and unfortunately, I'll not be part of the next one, but I would like to see things moving on this side. Everything I wrote is of course up to discussion, that's precisely the meaning of this mail. Thanks to all who'll give advice, suggestions, and thoughts about all this stuff. [1] https://github.com/redhat-openstack/tripleo-quickstart-utils [2] https://etherpad.openstack.org/p/qa-queens-ptg-destructive-testing -- Raoul Scarazzini [email protected] __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
