On 30/11/15 23:21, Steven Hardy wrote:
On Mon, Nov 30, 2015 at 10:03:29AM +0100, Lennart Regebro wrote:
I'm tasked to implement a command that shows error messages when a
deployment has failed. I have a vague memory of having seen scripts
that do something like this, if that exists, can somebody point me in
teh right direction?
I wrote a super simple script and put it in a blog post a while back:

http://hardysteven.blogspot.co.uk/2015/05/tripleo-heat-templates-part-3-cluster.html

All it does is find the failed SoftwareDeployment resources, then do heat
deployment-show on the resource, so you can see the stderr associated with
the failure.

Having tripleoclient do that by default would be useful.

Any opinions on what that should do, specifically? Traverse failed
resources to find error messages, I assume. Anything else?
Yeah, but I think for this to be useful, we need to go a bit deeper than
just showing the resource error - there are a number of typical failure
modes, and I end up repeating the same steps to debug every time.

1. SoftwareDeployment failed (mentioned above).  Every time, you need to
see the name of the SoftwareDeployment which failed, figure out if it
failed on one or all of the servers, then look at the stderr for clues.

2. A server failed to build (OS::Nova::Server resource is FAILED), here we
need to check both nova and ironic, looking first to see if ironic has the
node(s) in the wrong state for scheduling (e.g nova gave us a no valid
host error), and then if they are OK in ironic, do nova show on the failed
host to see the reason nova gives us for it failing to go ACTIVE.

3. A stack timeout happened.  IIRC when this happens, we currently fail
with an obscure keystone related backtrace due to the token expiring.  We
should instead catch this error and show the heat stack status_reason,
which should say clearly the stack timed out.

If we could just make these three cases really clear and easy to debug, I
think things would be much better (IME the above are a high proportion of
all failures), but I'm sure folks can come up with other ideas to add to
the list.

I'm actually drafting a spec which includes a command which does this. I hope to submit it soon, but here is the current state of that command's description:

Diagnosing resources in a FAILED state
--------------------------------------

One command will be implemented:
- openstack overcloud failed list

This will print a yaml tree showing the hierarchy of nested stacks until it
gets to the actual failed resource, then it will show information regarding the
failure. For most resource types this information will be the status_reason,
but for software-deployment resources the deploy_stdout, deploy_stderr and
deploy_status code will be printed.

In addition to this stand-alone command, this output will also be printed when
an ``openstack overcloud deploy`` or ``openstack overcloud update`` command
results in a stack in a FAILED state.


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to