Hi Jay, Dmitry and guys I submit two patches to try recover nodes which stuck in deploying state:
1. fix the issue of the ironic-conductor was brought up with a different hostname. https://review.openstack.org/325026 2. clear the lock of nodes in maintenance states https://review.openstack.org/#/c/324269/ If above solutions are promising, then we don't need a new tool to recover nodes in deploying state. B.R Tan -----Original Message----- From: Jay Faulkner [mailto:j...@jvf.cc] Sent: Thursday, June 2, 2016 7:45 AM To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [ironic] Tooling for recovering nodes Some comments inline. On 5/31/16 12:26 PM, Devananda van der Veen wrote: > On 05/31/2016 01:35 AM, Dmitry Tantsur wrote: >> On 05/31/2016 10:25 AM, Tan, Lin wrote: >>> Hi, >>> >>> Recently, I am working on a spec[1] in order to recover nodes which >>> get stuck in deploying state, so I really expect some feedback from you >>> guys. >>> >>> Ironic nodes can be stuck in >>> deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the >>> node is reserved by a dead conductor (the exclusive lock was not released). >>> Any further requests will be denied by ironic because it thinks the >>> node resource is under control of another conductor. >>> >>> To be more clear, let's narrow the scope and focus on the deploying >>> state first. Currently, people do have several choices to clear the >>> reserved lock: >>> 1. restart the dead conductor >>> 2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the >>> lock. >>> 3. The operator touches the DB to manually recover these nodes. >>> >>> Option two looks very promising but there are some weakness: >>> 2.1 It won't work if the dead conductor was renamed or deleted. >>> 2.2 It won't work if the node's specific driver was not enabled on >>> live conductors. >>> 2.3 It won't work if the node is in maintenance. (only a corner case). >> We can and should fix all three cases. > 2.1 and 2.2 appear to be a bug in the behavior of _check_deploying_status(). > > The method claims to do exactly what you suggest in 2.1 and 2.2 -- it > gathers a list of Nodes reserved by *any* offline conductor and tries to > release the lock. > However, it will always fail to update them, because > objects.Node.release() raises a NodeLocked exception when called on a Node > locked by a different conductor. > > Here's the relevant code path: > > ironic/conductor/manager.py: > 1259 def _check_deploying_status(self, context): > ... > 1269 offline_conductors = self.dbapi.get_offline_conductors() > ... > 1273 node_iter = self.iter_nodes( > 1274 fields=['id', 'reservation'], > 1275 filters={'provision_state': states.DEPLOYING, > 1276 'maintenance': False, > 1277 'reserved_by_any_of': offline_conductors}) > ... > 1281 for node_uuid, driver, node_id, conductor_hostname in node_iter: > 1285 try: > 1286 objects.Node.release(context, conductor_hostname, > node_id) > ... > 1292 except exception.NodeLocked: > 1293 LOG.warning(...) > 1297 continue > > > As far as 2.3, I think we should change the query string at the start > of this method so that it includes nodes in maintenance mode. I think > it's both safe and reasonable (and, frankly, what an operator will > expect) that a node which is in maintenance mode, and in DEPLOYING > state, whose conductor is offline, should have that reservation cleared and > be set to DEPLOYFAILED state. This is an excellent idea -- and I'm going to extend it further. If I have any nodes in a *ING state, and they are put into maintenance, it should force a failure. This is potentially a more API-friendly way of cleaning up nodes in bad states -- an operator would need to maintenance the node, and once it enters the *FAIL state, troubleshoot why it failed, unmaintenance, and return to production. I obviously strongly desire an "override command" as an operator, but I really think this could handle a large percentage of the use cases that made me desire it in the first place. > --devananda > >>> Definitely we should improve the option 2, but there are could be >>> more issues I didn't know in a more complicated environment. >>> So my question is do we still need a new command to recover these >>> node easier without accessing DB, like this PoC [2]: >>> ironic-noderecover --node_uuids=UUID1,UUID2 >>> --config-file=/etc/ironic/ironic.conf >> I'm -1 to anything silently removing the lock until I see a clear use >> case which is impossible to improve within Ironic itself. Such utility may >> and will be abused. >> >> I'm fine with anything that does not forcibly remove the lock by default. I agree such a utility could be abused. I don't think that's a good argument for not writing it for operators. However, I agree that any utility we write that could or would modify a lock should not do so by default, and should warn before doing so, but there are cases where getting a lock cleared is desirable and necessary. A good example of this would be an ironic-conductor failing while a node is locked, and being brought up with a different hostname. Today, there's no way to get that lock off that node again. Even if you force operators to replace a conductor with one with an identical hostname, during the time this replacement was occurring any nodes locked would remain locked. Thanks, Jay Faulkner >>> Best Regards, >>> >>> Tan >>> >>> >>> [1] https://review.openstack.org/#/c/319812 >>> [2] https://review.openstack.org/#/c/311273/ >>> > ______________________________________________________________________ > ____ OpenStack Development Mailing List (not for usage questions) > Unsubscribe: > openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev