Hi, Testing and maintaining a green status for upgrade jobs within the 3h time limit has proven to be a very difficult job to say the least.
The net result has been: we don't have anything even touching the upgrade code in the CI. So during Denver PTG it has been decided to give up on running a full upgrade job during the 3h time limit and instead to focus on two complementary approach to at least touch the upgrade code: 1. run a standalone upgrade: this test the ansible upgrade playbook; 2. run a N->N upgrade; this test the upgrade python code; And here there are, still not merged but seen working: - tripleo-ci-centos-7-standalone-upgrade: https://review.openstack.org/#/c/604706/ - tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades: https://review.openstack.org/#/c/607848/9 The first is good to merge (but other could disagree), the second could be as well (but I tend to disagree :)) The first leverage the standalone deployment and execute an standalone upgrade just after it. The limitation is that it only tests non-HA services (sorry pidone, cannot test ha in standalone) and only the upgrade_tasks (ie not any workflow related to the upgrade cli) The main benefits here are: - ~2h to run the upgrade, still a bit long but far away from the 3h time limit; - we trigger a yum upgrade so that we can catch problems there as well; - we test the standalone upgrade which is good in itself; - composable role available (as in standalone/all-in-all deployment) so you can make a specific upgrade test for your project if it fits into the standalone constraint; For this last point, if standalone specific role eventually goes into project testing (nova, neutron ...), they could have as well a way to test upgrade tasks. This would be a best case scenario. Now, for the second point, the N->N upgrade. Its "limitation" is that ... well it doesn't run a yum upgrade at all. We start from master and run the upgrade to master. It's main benefit are: - it takes ~2h20 to run, so well under the 3h time; - tripleoclient upgrade code is run, which is one thing that the standalone ugprade cannot do. - It also tend to exercise idempotency of all the tasks as it runs them on an already "upgraded" node; - As added bonus, it could gate the tripleo-upgrade role as well as it definitively loads all of the role's tasks[1] For those that stayed with me to this point, I'm throwing another CI test that already proved useful already (caught errors), it's the ansible-lint test. After a standalone deployment we just run ansible-lint on all playbook generated[2]. It produces standalone_ansible_lint.log[3] in the working directory. It only takes a couple of minute to install ansible-lint and run it. It definitively gate against typos and the like. It touches hard to reach code as well, for instance the fast_forward tasks are linted. Still no pidone tasks in there but it could easily be added to a job that has HA tasks generated. Note that by default ansible-lint barks, as the generated playbooks hit several lintage problems, so only syntax errors and misnamed tasks or parameters are currently activated. But all the lint problems are logged in the above file and can be fixed later on. At which point we could activate full lint gating. Thanks for this long reading, any comments, shout of victory, cry of despair and reviews are welcomed. [1] but this has still to be investigated. [2] testing review https://review.openstack.org/#/c/604756/ and main code https://review.openstack.org/#/c/604757/ [3] sample output http://paste.openstack.org/show/731960/ -- Sofer Athlan-Guyot chem on #freenode Upgrade DFG. __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev