Some comments inline.

On 5/31/16 12:26 PM, Devananda van der Veen wrote:
On 05/31/2016 01:35 AM, Dmitry Tantsur wrote:
On 05/31/2016 10:25 AM, Tan, Lin wrote:
Hi,

Recently, I am working on a spec[1] in order to recover nodes which get stuck
in deploying state, so I really expect some feedback from you guys.

Ironic nodes can be stuck in
deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the node is
reserved by a dead conductor (the exclusive lock was not released).
Any further requests will be denied by ironic because it thinks the node
resource is under control of another conductor.

To be more clear, let's narrow the scope and focus on the deploying state
first. Currently, people do have several choices to clear the reserved lock:
1. restart the dead conductor
2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the lock.
3. The operator touches the DB to manually recover these nodes.

Option two looks very promising but there are some weakness:
2.1 It won't work if the dead conductor was renamed or deleted.
2.2 It won't work if the node's specific driver was not enabled on live
conductors.
2.3 It won't work if the node is in maintenance. (only a corner case).
We can and should fix all three cases.
2.1 and 2.2 appear to be a bug in the behavior of _check_deploying_status().

The method claims to do exactly what you suggest in 2.1 and 2.2 -- it gathers a
list of Nodes reserved by *any* offline conductor and tries to release the lock.
However, it will always fail to update them, because objects.Node.release()
raises a NodeLocked exception when called on a Node locked by a different 
conductor.

Here's the relevant code path:

ironic/conductor/manager.py:
1259     def _check_deploying_status(self, context):
...
1269         offline_conductors = self.dbapi.get_offline_conductors()
...
1273         node_iter = self.iter_nodes(
1274             fields=['id', 'reservation'],
1275             filters={'provision_state': states.DEPLOYING,
1276                      'maintenance': False,
1277                      'reserved_by_any_of': offline_conductors})
...
1281         for node_uuid, driver, node_id, conductor_hostname in node_iter:
1285             try:
1286                 objects.Node.release(context, conductor_hostname, node_id)
...
1292             except exception.NodeLocked:
1293                 LOG.warning(...)
1297                 continue


As far as 2.3, I think we should change the query string at the start of this
method so that it includes nodes in maintenance mode. I think it's both safe and
reasonable (and, frankly, what an operator will expect) that a node which is in
maintenance mode, and in DEPLOYING state, whose conductor is offline, should
have that reservation cleared and be set to DEPLOYFAILED state.

This is an excellent idea -- and I'm going to extend it further. If I have any nodes in a *ING state, and they are put into maintenance, it should force a failure. This is potentially a more API-friendly way of cleaning up nodes in bad states -- an operator would need to maintenance the node, and once it enters the *FAIL state, troubleshoot why it failed, unmaintenance, and return to production.

I obviously strongly desire an "override command" as an operator, but I really think this could handle a large percentage of the use cases that made me desire it in the first place.

--devananda

Definitely we should improve the option 2, but there are could be more issues
I didn't know in a more complicated environment.
So my question is do we still need a new command to recover these node easier
without accessing DB, like this PoC [2]:
   ironic-noderecover --node_uuids=UUID1,UUID2
--config-file=/etc/ironic/ironic.conf
I'm -1 to anything silently removing the lock until I see a clear use case which
is impossible to improve within Ironic itself. Such utility may and will be 
abused.

I'm fine with anything that does not forcibly remove the lock by default.
I agree such a utility could be abused. I don't think that's a good argument for not writing it for operators. However, I agree that any utility we write that could or would modify a lock should not do so by default, and should warn before doing so, but there are cases where getting a lock cleared is desirable and necessary.

A good example of this would be an ironic-conductor failing while a node is locked, and being brought up with a different hostname. Today, there's no way to get that lock off that node again.

Even if you force operators to replace a conductor with one with an identical hostname, during the time this replacement was occurring any nodes locked would remain locked.

Thanks,
Jay Faulkner
Best Regards,

Tan


[1] https://review.openstack.org/#/c/319812
[2] https://review.openstack.org/#/c/311273/

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to