Public bug reported: Noticed this during a ceph job tempest run:
http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full- devstack-plugin-ceph-ubuntu- xenial/f2ef8c2/console.html#_2016-11-30_10_23_30_682154 2016-11-30 10:23:30.682154 | tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_shelve_unshelve_server[id-77eba8e0-036e-4635-944b-f7a8f3b78dc9] 2016-11-30 10:23:30.682250 | ------------------------------------------------------------------------------------------------------------------------------------------ 2016-11-30 10:23:30.682275 | 2016-11-30 10:23:30.682306 | Captured traceback-1: 2016-11-30 10:23:30.682348 | ~~~~~~~~~~~~~~~~~~~~~ 2016-11-30 10:23:30.682397 | Traceback (most recent call last): 2016-11-30 10:23:30.682467 | File "tempest/api/compute/servers/test_server_actions.py", line 58, in tearDown 2016-11-30 10:23:30.682522 | self.server_check_teardown() 2016-11-30 10:23:30.682580 | File "tempest/api/compute/base.py", line 166, in server_check_teardown 2016-11-30 10:23:30.682615 | cls.server_id, 'ACTIVE') 2016-11-30 10:23:30.682697 | File "tempest/common/waiters.py", line 95, in wait_for_server_status 2016-11-30 10:23:30.682751 | raise lib_exc.TimeoutException(message) 2016-11-30 10:23:30.682802 | tempest.lib.exceptions.TimeoutException: Request timed out 2016-11-30 10:23:30.682941 | Details: (ServerActionsTestJSON:tearDown) Server 409d71a2-e0b3-4bdf-b0f9-8e23fee18550 failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: SHELVED_OFFLOADED. Current task state: spawning. Tracing the lifecycle of the instance through the n-cpu logs, there is a race: 1. shelve http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full- devstack-plugin-ceph-ubuntu- xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_28_926 /oslo_concurrency/lockutils.py:270 2016-11-30 09:51:28.926 11001 INFO nova.compute.manager [req-0d4c19ae-ca59-48e5-a9b7-aa19a1cbec4e tempest-ServerActionsTestJSON-107143809 tempest-ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] Shelving 2. stopped http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full- devstack-plugin-ceph-ubuntu- xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_43_938 2016-11-30 09:51:43.938 11001 INFO nova.virt.libvirt.driver [req- 0d4c19ae-ca59-48e5-a9b7-aa19a1cbec4e tempest- ServerActionsTestJSON-107143809 tempest-ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] Instance shutdown successfully after 13 seconds. 3. unshelve http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full- devstack-plugin-ceph-ubuntu- xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_53_803 2016-11-30 09:51:53.803 11001 INFO nova.compute.manager [req-676d94d0 -c8ec-495b-be9b-eb9c0ee29a3d tempest-ServerActionsTestJSON-107143809 tempest-ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3-4bdf- b0f9-8e23fee18550] Unshelving 4. stop http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full- devstack-plugin-ceph-ubuntu- xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_58_921 2016-11-30 09:51:58.921 11001 DEBUG nova.virt.driver [-] Emitting event <LifecycleEvent: 1480499503.92, 409d71a2-e0b3-4bdf-b0f9-8e23fee18550 => Stopped> emit_event /opt/stack/new/nova/nova/virt/driver.py:1440 2016-11-30 09:51:58.921 11001 INFO nova.compute.manager [-] [instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] VM Stopped (Lifecycle Event) ^ is 15 seconds after the instance was stopped during shelve, which is: https://github.com/openstack/nova/blob/14.0.0/nova/virt/libvirt/host.py#L104 This is only a problem in the ceph job probably because it's using the rbd shallow clone feature which makes the snapshot taken during shelve- offloading fast super fast, and it's so fast that by the time the initial lifecycle event from shelve / stop operation in #2 is processed, we've already unshelved the instance and it's active with no task_state set, so the compute manager doesn't think it's going through any task and it's stopped because the libvirt driver says it should be stopped. ** Affects: nova Importance: Low Status: Triaged ** Tags: ceph gate race shelve ** Summary changed: - unshelved offloaded instance is unexpectedly stopped + unshelved offloaded instance is unexpectedly stopped during ceph job -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1646212 Title: unshelved offloaded instance is unexpectedly stopped during ceph job Status in OpenStack Compute (nova): Triaged Bug description: Noticed this during a ceph job tempest run: http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full- devstack-plugin-ceph-ubuntu- xenial/f2ef8c2/console.html#_2016-11-30_10_23_30_682154 2016-11-30 10:23:30.682154 | tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_shelve_unshelve_server[id-77eba8e0-036e-4635-944b-f7a8f3b78dc9] 2016-11-30 10:23:30.682250 | ------------------------------------------------------------------------------------------------------------------------------------------ 2016-11-30 10:23:30.682275 | 2016-11-30 10:23:30.682306 | Captured traceback-1: 2016-11-30 10:23:30.682348 | ~~~~~~~~~~~~~~~~~~~~~ 2016-11-30 10:23:30.682397 | Traceback (most recent call last): 2016-11-30 10:23:30.682467 | File "tempest/api/compute/servers/test_server_actions.py", line 58, in tearDown 2016-11-30 10:23:30.682522 | self.server_check_teardown() 2016-11-30 10:23:30.682580 | File "tempest/api/compute/base.py", line 166, in server_check_teardown 2016-11-30 10:23:30.682615 | cls.server_id, 'ACTIVE') 2016-11-30 10:23:30.682697 | File "tempest/common/waiters.py", line 95, in wait_for_server_status 2016-11-30 10:23:30.682751 | raise lib_exc.TimeoutException(message) 2016-11-30 10:23:30.682802 | tempest.lib.exceptions.TimeoutException: Request timed out 2016-11-30 10:23:30.682941 | Details: (ServerActionsTestJSON:tearDown) Server 409d71a2-e0b3-4bdf-b0f9-8e23fee18550 failed to reach ACTIVE status and task state "None" within the required time (196 s). Current status: SHELVED_OFFLOADED. Current task state: spawning. Tracing the lifecycle of the instance through the n-cpu logs, there is a race: 1. shelve http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full- devstack-plugin-ceph-ubuntu- xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_28_926 /oslo_concurrency/lockutils.py:270 2016-11-30 09:51:28.926 11001 INFO nova.compute.manager [req-0d4c19ae-ca59-48e5-a9b7-aa19a1cbec4e tempest-ServerActionsTestJSON-107143809 tempest-ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] Shelving 2. stopped http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full- devstack-plugin-ceph-ubuntu- xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_43_938 2016-11-30 09:51:43.938 11001 INFO nova.virt.libvirt.driver [req- 0d4c19ae-ca59-48e5-a9b7-aa19a1cbec4e tempest- ServerActionsTestJSON-107143809 tempest- ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3-4bdf- b0f9-8e23fee18550] Instance shutdown successfully after 13 seconds. 3. unshelve http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full- devstack-plugin-ceph-ubuntu- xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_53_803 2016-11-30 09:51:53.803 11001 INFO nova.compute.manager [req-676d94d0 -c8ec-495b-be9b-eb9c0ee29a3d tempest-ServerActionsTestJSON-107143809 tempest-ServerActionsTestJSON-107143809] [instance: 409d71a2-e0b3 -4bdf-b0f9-8e23fee18550] Unshelving 4. stop http://logs.openstack.org/08/404508/3/check/gate-tempest-dsvm-full- devstack-plugin-ceph-ubuntu- xenial/f2ef8c2/logs/screen-n-cpu.txt.gz#_2016-11-30_09_51_58_921 2016-11-30 09:51:58.921 11001 DEBUG nova.virt.driver [-] Emitting event <LifecycleEvent: 1480499503.92, 409d71a2-e0b3-4bdf-b0f9-8e23fee18550 => Stopped> emit_event /opt/stack/new/nova/nova/virt/driver.py:1440 2016-11-30 09:51:58.921 11001 INFO nova.compute.manager [-] [instance: 409d71a2-e0b3-4bdf-b0f9-8e23fee18550] VM Stopped (Lifecycle Event) ^ is 15 seconds after the instance was stopped during shelve, which is: https://github.com/openstack/nova/blob/14.0.0/nova/virt/libvirt/host.py#L104 This is only a problem in the ceph job probably because it's using the rbd shallow clone feature which makes the snapshot taken during shelve-offloading fast super fast, and it's so fast that by the time the initial lifecycle event from shelve / stop operation in #2 is processed, we've already unshelved the instance and it's active with no task_state set, so the compute manager doesn't think it's going through any task and it's stopped because the libvirt driver says it should be stopped. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1646212/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp