Hey Tan, some comments inline.

On 5/31/16 1:25 AM, Tan, Lin wrote:
Hi,

Recently, I am working on a spec[1] in order to recover nodes which get stuck 
in deploying state, so I really expect some feedback from you guys.

Ironic nodes can be stuck in 
deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the node is 
reserved by a dead conductor (the exclusive lock was not released).
Any further requests will be denied by ironic because it thinks the node 
resource is under control of another conductor.

To be more clear, let's narrow the scope and focus on the deploying state 
first. Currently, people do have several choices to clear the reserved lock:
1. restart the dead conductor
2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the lock.
3. The operator touches the DB to manually recover these nodes.
I actually like option #3 being optionally integrated into a tool to clear nodes stuck in *ing state. If specified, it would clear the lock on the deploy as it moved it from DEPLOYING -> DEPLOYFAILED. Obviously, for cleaning this could be dangerous, and should be documented as so -- imagine clearing a lock mid-firmware flash and having a power action taken to brick the node.

Given this is tooling intended to handle many cases, I think it's better to give the operator the choice to take more dramatic action if they wish.


Thanks,
Jay Faulkner
Option two looks very promising but there are some weakness:
2.1 It won't work if the dead conductor was renamed or deleted.
2.2 It won't work if the node's specific driver was not enabled on live 
conductors.
2.3 It won't work if the node is in maintenance. (only a corner case).

Definitely we should improve the option 2, but there are could be more issues I 
didn't know in a more complicated environment.
So my question is do we still need a new command to recover these node easier 
without accessing DB, like this PoC [2]:
   ironic-noderecover --node_uuids=UUID1,UUID2  
--config-file=/etc/ironic/ironic.conf

Best Regards,

Tan


[1] https://review.openstack.org/#/c/319812
[2] https://review.openstack.org/#/c/311273/


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to