Re: [openstack-dev] [ironic] Tooling for recovering nodes

Tan, Lin Sun, 10 Jul 2016 23:04:36 -0700

Hi Jay, Dmitry and guys

I submit two patches to try recover nodes which stuck in deploying state:


1. fix the issue of the ironic-conductor was brought up with a different 
hostname.
     https://review.openstack.org/325026
2. clear the lock of nodes in maintenance states
     https://review.openstack.org/#/c/324269/

If above solutions are promising, then we don't need a new tool to recover 
nodes in deploying state.

B.R

Tan



-----Original Message-----
From: Jay Faulkner [mailto:j...@jvf.cc] 
Sent: Thursday, June 2, 2016 7:45 AM
To: openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [ironic] Tooling for recovering nodes

Some comments inline.


On 5/31/16 12:26 PM, Devananda van der Veen wrote:
> On 05/31/2016 01:35 AM, Dmitry Tantsur wrote:
>> On 05/31/2016 10:25 AM, Tan, Lin wrote:
>>> Hi,
>>>
>>> Recently, I am working on a spec[1] in order to recover nodes which 
>>> get stuck in deploying state, so I really expect some feedback from you 
>>> guys.
>>>
>>> Ironic nodes can be stuck in
>>> deploying/deploywait/cleaning/cleanwait/inspecting/deleting if the 
>>> node is reserved by a dead conductor (the exclusive lock was not released).
>>> Any further requests will be denied by ironic because it thinks the 
>>> node resource is under control of another conductor.
>>>
>>> To be more clear, let's narrow the scope and focus on the deploying 
>>> state first. Currently, people do have several choices to clear the 
>>> reserved lock:
>>> 1. restart the dead conductor
>>> 2. wait up to 2 or 3 minutes and _check_deploying_states() will clear the 
>>> lock.
>>> 3. The operator touches the DB to manually recover these nodes.
>>>
>>> Option two looks very promising but there are some weakness:
>>> 2.1 It won't work if the dead conductor was renamed or deleted.
>>> 2.2 It won't work if the node's specific driver was not enabled on 
>>> live conductors.
>>> 2.3 It won't work if the node is in maintenance. (only a corner case).
>> We can and should fix all three cases.
> 2.1 and 2.2 appear to be a bug in the behavior of _check_deploying_status().
>
> The method claims to do exactly what you suggest in 2.1 and 2.2 -- it 
> gathers a list of Nodes reserved by *any* offline conductor and tries to 
> release the lock.
> However, it will always fail to update them, because 
> objects.Node.release() raises a NodeLocked exception when called on a Node 
> locked by a different conductor.
>
> Here's the relevant code path:
>
> ironic/conductor/manager.py:
> 1259     def _check_deploying_status(self, context):
> ...
> 1269         offline_conductors = self.dbapi.get_offline_conductors()
> ...
> 1273         node_iter = self.iter_nodes(
> 1274             fields=['id', 'reservation'],
> 1275             filters={'provision_state': states.DEPLOYING,
> 1276                      'maintenance': False,
> 1277                      'reserved_by_any_of': offline_conductors})
> ...
> 1281         for node_uuid, driver, node_id, conductor_hostname in node_iter:
> 1285             try:
> 1286                 objects.Node.release(context, conductor_hostname, 
> node_id)
> ...
> 1292             except exception.NodeLocked:
> 1293                 LOG.warning(...)
> 1297                 continue
>
>
> As far as 2.3, I think we should change the query string at the start 
> of this method so that it includes nodes in maintenance mode. I think 
> it's both safe and reasonable (and, frankly, what an operator will 
> expect) that a node which is in maintenance mode, and in DEPLOYING 
> state, whose conductor is offline, should have that reservation cleared and 
> be set to DEPLOYFAILED state.

This is an excellent idea -- and I'm going to extend it further. If I have any 
nodes in a *ING state, and they are put into maintenance, it should force a 
failure. This is potentially a more API-friendly way of cleaning up nodes in 
bad states -- an operator would need to maintenance the node, and once it 
enters the *FAIL state, troubleshoot why it failed, unmaintenance, and return 
to production.

I obviously strongly desire an "override command" as an operator, but I really 
think this could handle a large percentage of the use cases that made me desire 
it in the first place.

> --devananda
>
>>> Definitely we should improve the option 2, but there are could be 
>>> more issues I didn't know in a more complicated environment.
>>> So my question is do we still need a new command to recover these 
>>> node easier without accessing DB, like this PoC [2]:
>>>    ironic-noderecover --node_uuids=UUID1,UUID2 
>>> --config-file=/etc/ironic/ironic.conf
>> I'm -1 to anything silently removing the lock until I see a clear use 
>> case which is impossible to improve within Ironic itself. Such utility may 
>> and will be abused.
>>
>> I'm fine with anything that does not forcibly remove the lock by default.
I agree such a utility could be abused. I don't think that's a good argument 
for not writing it for operators. However, I agree that any utility we write 
that could or would modify a lock should not do so by default, and should warn 
before doing so, but there are cases where getting a lock cleared is desirable 
and necessary.

A good example of this would be an ironic-conductor failing while a node is 
locked, and being brought up with a different hostname. Today, there's no way 
to get that lock off that node again.

Even if you force operators to replace a conductor with one with an identical 
hostname, during the time this replacement was occurring any nodes locked would 
remain locked.

Thanks,
Jay Faulkner
>>> Best Regards,
>>>
>>> Tan
>>>
>>>
>>> [1] https://review.openstack.org/#/c/319812
>>> [2] https://review.openstack.org/#/c/311273/
>>>
> ______________________________________________________________________
> ____ OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: 
> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [ironic] Tooling for recovering nodes

Reply via email to