On 05/08/2017 06:45 AM, Marios Andreou wrote:
Hi folks, after some discussion locally with colleagues about improving
the upgrades experience, one of the items that came up was pre-upgrade
and update validations. I took an AI to look at the current status of
tripleo-validations [0] and posted a simple WIP [1] intended to be run
before an undercloud update/upgrade and which just checks service
status. It was pointed out by shardy that for such checks it is better
to instead continue to use the per-service manifests where possible
like [2] for example where we check status before N..O major upgrade.
There may still be some undercloud specific validations that we can land
into the tripleo-validations repo (thinking about things like the
neutron networks/ports, validating the current nova nodes state etc?).
So do folks have any thoughts about this subject - for example the kinds
of things we should be checking - Steve said he had some reviews in
progress for collecting the overcloud ansible puppet/docker config into
an ansible playbook that the operator can invoke for upgrade of the
'manual' nodes (for example compute in the N..O workflow) - the point
being that we can add more per-service ansible validation tasks into the
service manifests for execution when the play is run by the operator -
but I'll let Steve point at and talk about those.
We had a similar discussion regarding controller node replacement
because starting that process with the overcloud in an inconsistent
state tends to end badly. Unfortunately those docs are only available
downstream at this time, but the basics were:
-Verify that the stack is in a *_COMPLETE state (this may seem obvious,
but we've had people try to do these major processes while the stack is
in a broken state)
-Verify undercloud disk space. For node replacement we recommended a
minimum of 10 GB free.
-Verify that all pacemaker services are up.
-Check Galera and Rabbit clusters and verify all nodes are up.
-For node replacement we also disabled stonith. That might be a good
idea during upgrades as well in case some services take a while to come
back up. You really don't want a node getting killed during the process.
-General undercloud service checks (nova, ironic, etc.)
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev