[Yahoo-eng-team] [Bug 1654102] Re: Ironic: TypeError: unsupported operand type(s) for *: 'NoneType' and 'int' - during select_destinations()

OpenStack Infra Fri, 06 Jan 2017 16:57:07 -0800

Reviewed:  https://review.openstack.org/416822
Committed: 
https://git.openstack.org/cgit/openstack/nova/commit/?id=2f1245a56c238646a056a20caa543f2a254b83d7
Submitter: Jenkins
Branch:    master


commit 2f1245a56c238646a056a20caa543f2a254b83d7
Author: Matt Riedemann <mrie...@us.ibm.com>
Date:   Wed Jan 4 22:06:23 2017 -0500

    Fix TypeError in _update_from_compute_node race
    
    A compute node record is created in the database from the
    _init_compute_node method in the resource tracker in the
    nova-compute service. It is created without the updated_at
    or free_disk_gb fields being set.
    
    Shortly after it's created, update_resource_stats() is called
    on the scheduler report client which calls compute_node.save()
    which calls the update_compute_node() method in the DB API which
    *always* updates the updated_at field even if nothing else
    was changed.
    
    So at this point, we have a compute node with updated_at set
    but not free_disk_gb. The free_disk_gb field gets set by the
    _update_usage_from_instances() method in the resource tracker
    and then that value is later saved off when the compute node
    is updated in the database on a second call to the
    update_resource_stats() method in the scheduler report client.
    At that point the free_disk_gb field is set on the compute
    node record.
    
    There is a race in between the compute node create and initial
    update but before free_disk_gb is set where the scheduler host
    manager can be attempting to update the HostState object from
    the same compute node when selecting host destinations during
    a server build. If that picks up the compute node before its
    free_disk_gb field is set it will result in a TypeError when
    trying to multiply None * 1024.
    
    Change 36a0ba9c8141b445f2c6bfc093fde4cc98d229b2 was an earlier
    attempt at fixing this bug and shortened the race window but
    didn't handle the problem where updated_at is set but free_disk_gb
    is not yet.
    
    This change builds on that by simply checking for the thing
    the scheduler host manager actually cares about, which is the
    free_disk_gb field.
    
    Closes-Bug: #1654102
    Closes-Bug: #1610679
    
    Change-Id: I37b75dabb3ea7ec2d5678550d9aff30b1a0c29e6


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1654102

Title:
  Ironic: TypeError: unsupported operand type(s) for *: 'NoneType' and
  'int' - during select_destinations()

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) newton series:
  Confirmed

Bug description:
  This looks like a duplicate of bug 1610679 but that's for the non-
  Ironic scheduler host manager, this is for the Ironic host manager.

  Seen here:

  http://logs.openstack.org/20/407220/2/check/gate-tempest-dsvm-ironic-
  ipa-wholedisk-agent_ipmitool-tinyipa-multinode-ubuntu-xenial-
  nv/0779f94/logs/screen-n-sch.txt.gz#_2017-01-04_19_05_16_398

  2017-01-04 19:05:16.398 20709 DEBUG oslo_concurrency.lockutils 
[req-6ce318a2-53e5-4b80-84a5-f896ab48d627 
tempest-BaremetalMultitenancy-1820943769 
tempest-BaremetalMultitenancy-1820943769] Lock 
"(u'ubuntu-xenial-2-node-rax-ord-6478210', 
u'21da4933-a128-45f6-a765-7e6bc071e0f3')" released by 
"nova.scheduler.host_manager._locked_update" :: held 0.002s inner 
/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:282
  2017-01-04 19:05:16.398 20709 DEBUG oslo_concurrency.lockutils 
[req-6ce318a2-53e5-4b80-84a5-f896ab48d627 
tempest-BaremetalMultitenancy-1820943769 
tempest-BaremetalMultitenancy-1820943769] Lock 
"(u'ubuntu-xenial-2-node-rax-ord-6478210-359401', 
u'00d62acd-6f3b-4cb8-9668-12517c84b3b9')" acquired by 
"nova.scheduler.host_manager._locked_update" :: waited 0.000s inner 
/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:270
  2017-01-04 19:05:16.398 20709 DEBUG nova.scheduler.host_manager 
[req-6ce318a2-53e5-4b80-84a5-f896ab48d627 
tempest-BaremetalMultitenancy-1820943769 
tempest-BaremetalMultitenancy-1820943769] Update host state from compute node: 
ComputeNode(cpu_allocation_ratio=16.0,cpu_info='',created_at=2017-01-04T19:02:54Z,current_workload=None,deleted=False,deleted_at=None,disk_allocation_ratio=1.0,disk_available_least=10,free_disk_gb=None,free_ram_mb=None,host='ubuntu-xenial-2-node-rax-ord-6478210-359401',host_ip=10.210.37.31,hypervisor_hostname='00d62acd-6f3b-4cb8-9668-12517c84b3b9',hypervisor_type='ironic',hypervisor_version=1,id=8,local_gb=10,local_gb_used=0,memory_mb=384,memory_mb_used=0,metrics=None,numa_topology=None,pci_device_pools=PciDevicePoolList,ram_allocation_ratio=1.0,running_vms=None,service_id=None,stats={cpu_arch='x86_64'},supported_hv_specs=[HVSpec],updated_at=2017-01-04T19:04:51Z,uuid=521cf775-5a16-4111-bf26-c30fb6725716,vcpus=1,vcpus_used=0)
 _locked_update /opt/stack/new/nova/
 nova/scheduler/host_manager.py:168
  2017-01-04 19:05:16.400 20709 DEBUG oslo_concurrency.lockutils 
[req-6ce318a2-53e5-4b80-84a5-f896ab48d627 
tempest-BaremetalMultitenancy-1820943769 
tempest-BaremetalMultitenancy-1820943769] Lock 
"(u'ubuntu-xenial-2-node-rax-ord-6478210-359401', 
u'00d62acd-6f3b-4cb8-9668-12517c84b3b9')" released by 
"nova.scheduler.host_manager._locked_update" :: held 0.002s inner 
/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:282
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server 
[req-6ce318a2-53e5-4b80-84a5-f896ab48d627 
tempest-BaremetalMultitenancy-1820943769 
tempest-BaremetalMultitenancy-1820943769] Exception during message handling
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server Traceback (most 
recent call last):
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server   File 
"/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 
155, in _process_incoming
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server     res = 
self.dispatcher.dispatch(message)
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server   File 
"/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 
222, in dispatch
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server     return 
self._do_dispatch(endpoint, method, ctxt, args)
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server   File 
"/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 
192, in _do_dispatch
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server     result = 
func(ctxt, **new_args)
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server   File 
"/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 
218, in inner
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server     return 
func(*args, **kwargs)
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server   File 
"/opt/stack/new/nova/nova/scheduler/manager.py", line 84, in select_destinations
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server     dests = 
self.driver.select_destinations(ctxt, spec_obj)
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server   File 
"/opt/stack/new/nova/nova/scheduler/filter_scheduler.py", line 51, in 
select_destinations
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server     
selected_hosts = self._schedule(context, spec_obj)
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server   File 
"/opt/stack/new/nova/nova/scheduler/filter_scheduler.py", line 96, in _schedule
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server     hosts = 
self._get_all_host_states(elevated)
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server   File 
"/opt/stack/new/nova/nova/scheduler/filter_scheduler.py", line 134, in 
_get_all_host_states
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server     return 
self.host_manager.get_all_host_states(context)
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server   File 
"/opt/stack/new/nova/nova/scheduler/host_manager.py", line 610, in 
get_all_host_states
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server     
self._get_instance_info(context, compute))
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server   File 
"/opt/stack/new/nova/nova/scheduler/host_manager.py", line 180, in update
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server     return 
_locked_update(self, compute, service, aggregates, inst_dict)
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server   File 
"/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", line 
271, in inner
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server     return 
f(*args, **kwargs)
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server   File 
"/opt/stack/new/nova/nova/scheduler/host_manager.py", line 169, in 
_locked_update
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server     
self._update_from_compute_node(compute)
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server   File 
"/opt/stack/new/nova/nova/scheduler/ironic_host_manager.py", line 46, in 
_update_from_compute_node
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server     
self.free_disk_mb = compute.free_disk_gb * 1024
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server TypeError: 
unsupported operand type(s) for *: 'NoneType' and 'int'
  2017-01-04 19:05:16.400 20709 ERROR oslo_messaging.rpc.server

  Right before the scheduler failure I see this in the n-cpu logs:

  http://logs.openstack.org/20/407220/2/check/gate-tempest-dsvm-ironic-
  ipa-wholedisk-agent_ipmitool-tinyipa-multinode-ubuntu-xenial-
  nv/0779f94/logs/screen-n-cpu.txt.gz#_2017-01-04_19_03_55_573

  2017-01-04 19:03:55.573 28458 INFO nova.compute.manager [req-a0b07a5e-
  2c5a-4550-a0b5-6d1fe2b1d9e0 - -] Deleting orphan compute node 1
  hypervisor host is 00d62acd-6f3b-4cb8-9668-12517c84b3b9, nodes are
  set([u'74396195-3bd0-4bc3-9f51-2f9d9ff32616',
  u'dd8dcc23-04f4-4718-a079-e1b77d70bad9', u'36827858-740c-4f1d-
  a0e9-17339f441338', u'86939305-5678-4796-a250-76295faf1890',
  u'0f813337-25e3-4c12-a2b0-cf2222f3cc7f',
  u'21da4933-a128-45f6-a765-7e6bc071e0f3'])

  I'm not really sure what's going on with the compute node record
  though, if it's getting deleted or what, because once the resource
  tracker creates the compute node record it should be setting
  free_disk_gb shortly thereafter and then updating the record in the
  database. There is definitely a window of time between when the
  compute node record is first created and when it's updated, but that
  doesn't appear to be the case here since the updated_at field is set:

  updated_at=2017-01-04T19:04:51Z

  But the failure is about 25 seconds after that.

  This also seems to be happening quite a bit since January 2nd 2017:

  
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22_update_from_compute_node%5C%22%20AND%20message%3A%5C%22TypeError%3A%20unsupported%20operand%20type(s)%20for%20*%3A%20'NoneType'%20and%20'int'%5C%22%20AND%20tags%3A%5C%22screen-n-sch.txt%5C%22&from=7d

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1654102/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1654102] Re: Ironic: TypeError: unsupported operand type(s) for *: 'NoneType' and 'int' - during select_destinations()

Reply via email to