In the case of baremetal in our environment, when a boot attempt fails we mark that node as being in maintenance mode, which prevents Nova from scheduling to it a second time. Then automation comes along and files repair tickets for the bad hardware. Only when a human or other automation fixes the node and removes the "maintenance" state, will it be available for scheduling again.
On Mon, May 22, 2017 at 1:25 PM, Eric Fried <openst...@fried.cc> wrote: > Hey folks, sorry if this is a jejune question, but: > > In a no-reschedules-by-nova world, if a deploy fails on host 1, how does > the orchestrator (whatever that may be) ask nova to deploy in such a way > that it'll still try to find a good host, but *avoid* host 1? If host 1 > was an attractive candidate the first time around, wouldn't it be likely > to remain high on the list the second time? > > I'd also like to second the thought that the monolithic "instance in > error state" gives the orchestrator no hint as to whether the deploy > failed because of something the orchestrator did (remedy may be to > redrive with different inputs, but no need to exclude the original > target host) versus because something went wrong on the compute host > (remedy would be to retry on a different host with the same inputs). > Kind of analogous to the difference between HTTP 4xx and 5xx error > classes. (Perhaps implying a design whereby the nova API responds to > the deploy request with different error codes accordingly.) > > Thanks, > efried > . > > _______________________________________________ > OpenStack-operators mailing list > OpenStack-operators@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >
_______________________________________________ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators