In the case of baremetal in our environment, when a boot attempt fails we
mark that node as being in maintenance mode, which prevents Nova from
scheduling to it a second time. Then automation comes along and files
repair tickets for the bad hardware. Only when a human or other automation
fixes the node and removes the "maintenance" state, will it be available
for scheduling again.

On Mon, May 22, 2017 at 1:25 PM, Eric Fried <openst...@fried.cc> wrote:

> Hey folks, sorry if this is a jejune question, but:
>
> In a no-reschedules-by-nova world, if a deploy fails on host 1, how does
> the orchestrator (whatever that may be) ask nova to deploy in such a way
> that it'll still try to find a good host, but *avoid* host 1?  If host 1
> was an attractive candidate the first time around, wouldn't it be likely
> to remain high on the list the second time?
>
> I'd also like to second the thought that the monolithic "instance in
> error state" gives the orchestrator no hint as to whether the deploy
> failed because of something the orchestrator did (remedy may be to
> redrive with different inputs, but no need to exclude the original
> target host) versus because something went wrong on the compute host
> (remedy would be to retry on a different host with the same inputs).
> Kind of analogous to the difference between HTTP 4xx and 5xx error
> classes.  (Perhaps implying a design whereby the nova API responds to
> the deploy request with different error codes accordingly.)
>
> Thanks,
> efried
> .
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Reply via email to