Public bug reported: tl;dr The live migration monitoring thread writes migration stats to the database every 5 seconds and if the live migration runs for a sufficiently long time, the message queue can become overwhelmed culminating in a MessagingTimeout raised from the RPC.
This issue was discovered during investigation of a downstream bug [1] where a live migration was dirtying memory faster than the transfer and the live migration failed in nova-compute with a MessagingTimeout, for example: Live migration failed.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 389a868aba184d1a9956337bca78a569 Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 425, in get return self._queues[msg_id].get(block=True, timeout=timeout) File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get return waiter.wait() File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait return get_hub().switch() File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch return self.greenlet.switch() queue.Empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 7466, in _do_live_migration block_migration, migrate_data) File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8858, in live_migration migrate_data) File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 9382, in _live_migration finish_event, disk_paths) File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 9256, in _live_migration_monitor info, remaining) File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py", line 557, in save_stats migration.save() File "/usr/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 210, in wrapper ctxt, self, fn.__name__, args, kwargs) File "/usr/lib/python3.6/site-packages/nova/conductor/rpcapi.py", line 246, in object_action objmethod=objmethod, args=args, kwargs=kwargs) File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call transport_options=self.transport_options) File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send transport_options=transport_options) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 674, in send transport_options=transport_options) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 662, in _send call_monitor_timeout) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 551, in wait message = self.waiters.get(msg_id, timeout=timeout) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 429, in get 'to message ID %s' % msg_id) oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 389a868aba184d1a9956337bca78a569 The bug reporter was able to reproduce the problem in a lab environment and after we tracked things down to the live migration monitoring thread database writes as the possible culprit, he was able to run some tests [2][3] experimenting with changing the database write interval to be less frequent (once every 30 seconds for example) and the MessagingTimeout stopped occurring. Note that the live migration still had other issues due to dirtying memory faster than the transfer and not having live_migration_timeout_action set to force_complete but with the slower database write interval there we no MessagingTimeout errors from the message queue and the migration failed as expected due to the live migration itself timing out as it was not able to get ahead of the memory dirtying. My thought about how to address this is to make the interval for the database writes in the live migration monitoring thread configurable, so that operators can control and tune it to their deployment environments. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2312196 [2] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c23 [3] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c24 ** Affects: nova Importance: Undecided Status: New ** Tags: compute libvirt live-migration -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2091138 Title: Rapid live migration monitor database writes can lead to MessagingTimeout Status in OpenStack Compute (nova): New Bug description: tl;dr The live migration monitoring thread writes migration stats to the database every 5 seconds and if the live migration runs for a sufficiently long time, the message queue can become overwhelmed culminating in a MessagingTimeout raised from the RPC. This issue was discovered during investigation of a downstream bug [1] where a live migration was dirtying memory faster than the transfer and the live migration failed in nova-compute with a MessagingTimeout, for example: Live migration failed.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 389a868aba184d1a9956337bca78a569 Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 425, in get return self._queues[msg_id].get(block=True, timeout=timeout) File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get return waiter.wait() File "/usr/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait return get_hub().switch() File "/usr/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch return self.greenlet.switch() queue.Empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 7466, in _do_live_migration block_migration, migrate_data) File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8858, in live_migration migrate_data) File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 9382, in _live_migration finish_event, disk_paths) File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 9256, in _live_migration_monitor info, remaining) File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/migration.py", line 557, in save_stats migration.save() File "/usr/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 210, in wrapper ctxt, self, fn.__name__, args, kwargs) File "/usr/lib/python3.6/site-packages/nova/conductor/rpcapi.py", line 246, in object_action objmethod=objmethod, args=args, kwargs=kwargs) File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call transport_options=self.transport_options) File "/usr/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send transport_options=transport_options) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 674, in send transport_options=transport_options) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 662, in _send call_monitor_timeout) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 551, in wait message = self.waiters.get(msg_id, timeout=timeout) File "/usr/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 429, in get 'to message ID %s' % msg_id) oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 389a868aba184d1a9956337bca78a569 The bug reporter was able to reproduce the problem in a lab environment and after we tracked things down to the live migration monitoring thread database writes as the possible culprit, he was able to run some tests [2][3] experimenting with changing the database write interval to be less frequent (once every 30 seconds for example) and the MessagingTimeout stopped occurring. Note that the live migration still had other issues due to dirtying memory faster than the transfer and not having live_migration_timeout_action set to force_complete but with the slower database write interval there we no MessagingTimeout errors from the message queue and the migration failed as expected due to the live migration itself timing out as it was not able to get ahead of the memory dirtying. My thought about how to address this is to make the interval for the database writes in the live migration monitoring thread configurable, so that operators can control and tune it to their deployment environments. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2312196 [2] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c23 [3] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c24 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2091138/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp