** Changed in: nova Status: Fix Committed => Fix Released ** Changed in: nova Milestone: None => kilo-2
-- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1402574 Title: No fault-tolerance in nova-scheduler Status in OpenStack Compute (Nova): Fix Released Bug description: In the case a nova-scheduler service dies during processing (see below how to reproduce it), the message is not rescheduled to another one in a HA setup. Oslo messaging raises a timeout in the conductor: 2014-12-11 07:49:53.565 ERROR nova.scheduler.driver [req-f866a584-ba67-42a8-aec7-5500b631708e admin admin] Exception during scheduler.run_instance Traceback (most recent call last): File "/opt/stack/nova/nova/conductor/manager.py", line 640, in build_instances request_spec, filter_properties) File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 49, in select_destinations context, request_spec, filter_properties) File "/opt/stack/nova/nova/scheduler/client/__init__.py", line 35, in __run_method return getattr(self.instance, __name)(*args, **kwargs) File "/opt/stack/nova/nova/scheduler/client/query.py", line 34, in select_destinations context, request_spec, filter_properties) File "/opt/stack/nova/nova/scheduler/rpcapi.py", line 118, in select_destinations request_spec=request_spec, filter_properties=filter_properties) File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call retry=self.retry) File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send timeout=timeout, retry=retry) File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 436, in send retry=retry) File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 425, in _send result = self._waiter.wait(msg_id, timeout) File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 315, in wait reply, ending = self._poll_connection(msg_id, timer) File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 264, in _poll_connection % msg_id) MessagingTimeout: Timed out waiting for a reply to message ID aec640c6da0f4cf383b5100ba2441331 The proper behavior would be to at least try once again, even in a single machine setup - the message will be picked up by another server or the same one when it restarts. The Oslo messaging architecture doesn't support this being handled by the AMQP server, so message rescheduling has to be implemented in Nova (by the application logic). To reproduce the error, I added ipdb.set_trace() in nova/scheduler/filter_scheduler.py:287 before returning selected_hosts in the _schedule method. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1402574/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp