On 05/31/2016 01:35 AM, Dmitry Tantsur wrote: > On 05/31/2016 10:25 AM, Tan, Lin wrote: >> Hi, >> >> Recently, I am working on a spec[1] in order to recover nodes which get stuck >> in deploying state, so I really expect some feedback from you guys. >> >> Ironic nodes can be stuck in >> deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the node is >> reserved by a dead conductor (the exclusive lock was not released). >> Any further requests will be denied by ironic because it thinks the node >> resource is under control of another conductor. >> >> To be more clear, let's narrow the scope and focus on the deploying state >> first. Currently, people do have several choices to clear the reserved lock: >> 1. restart the dead conductor >> 2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the >> lock. >> 3. The operator touches the DB to manually recover these nodes. >> >> Option two looks very promising but there are some weakness: >> 2.1 It won't work if the dead conductor was renamed or deleted. >> 2.2 It won't work if the node's specific driver was not enabled on live >> conductors. >> 2.3 It won't work if the node is in maintenance. (only a corner case). > > We can and should fix all three cases.
2.1 and 2.2 appear to be a bug in the behavior of _check_deploying_status(). The method claims to do exactly what you suggest in 2.1 and 2.2 -- it gathers a list of Nodes reserved by *any* offline conductor and tries to release the lock. However, it will always fail to update them, because objects.Node.release() raises a NodeLocked exception when called on a Node locked by a different conductor. Here's the relevant code path: ironic/conductor/manager.py: 1259 def _check_deploying_status(self, context): ... 1269 offline_conductors = self.dbapi.get_offline_conductors() ... 1273 node_iter = self.iter_nodes( 1274 fields=['id', 'reservation'], 1275 filters={'provision_state': states.DEPLOYING, 1276 'maintenance': False, 1277 'reserved_by_any_of': offline_conductors}) ... 1281 for node_uuid, driver, node_id, conductor_hostname in node_iter: 1285 try: 1286 objects.Node.release(context, conductor_hostname, node_id) ... 1292 except exception.NodeLocked: 1293 LOG.warning(...) 1297 continue As far as 2.3, I think we should change the query string at the start of this method so that it includes nodes in maintenance mode. I think it's both safe and reasonable (and, frankly, what an operator will expect) that a node which is in maintenance mode, and in DEPLOYING state, whose conductor is offline, should have that reservation cleared and be set to DEPLOYFAILED state. --devananda >> >> Definitely we should improve the option 2, but there are could be more issues >> I didn't know in a more complicated environment. >> So my question is do we still need a new command to recover these node easier >> without accessing DB, like this PoC [2]: >> ironic-noderecover --node_uuids=UUID1,UUID2 >> --config-file=/etc/ironic/ironic.conf > > I'm -1 to anything silently removing the lock until I see a clear use case > which > is impossible to improve within Ironic itself. Such utility may and will be > abused. > > I'm fine with anything that does not forcibly remove the lock by default. > >> >> Best Regards, >> >> Tan >> >> >> [1] https://review.openstack.org/#/c/319812 >> [2] https://review.openstack.org/#/c/311273/ >> __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev