Public bug reported: The ironic virt driver does some crazy things when the ironic API goes down - it returns [] from get_available_nodes(). When the resource tracker sees this, it immediately attempts to delete all of the compute node records and resource providers for said nodes.
If placement is also down at this time, the resource providers will not be properly deleted. When ironic-api and placement-api return, nova will see nodes, create compute_node records for them, and try to create new resource providers (as they are new compute_node records). This will fail with a name conflict, and the nodes will be unusable. This is easy to fix, by raising an exception in get_available_nodes, instead of lying to the resource tracker and returning []. However, this causes nova-compute to fail to start if ironic-api is not available. This may be fine but should have a larger discussion. We've added these hacks over the years for some reason, we should look at the bigger picture and decide how we want to handle these cases. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1750450 Title: ironic: n-cpu fails to recover after losing connection to ironic-api and placement-api Status in OpenStack Compute (nova): New Bug description: The ironic virt driver does some crazy things when the ironic API goes down - it returns [] from get_available_nodes(). When the resource tracker sees this, it immediately attempts to delete all of the compute node records and resource providers for said nodes. If placement is also down at this time, the resource providers will not be properly deleted. When ironic-api and placement-api return, nova will see nodes, create compute_node records for them, and try to create new resource providers (as they are new compute_node records). This will fail with a name conflict, and the nodes will be unusable. This is easy to fix, by raising an exception in get_available_nodes, instead of lying to the resource tracker and returning []. However, this causes nova-compute to fail to start if ironic-api is not available. This may be fine but should have a larger discussion. We've added these hacks over the years for some reason, we should look at the bigger picture and decide how we want to handle these cases. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1750450/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp