I'm happy the main root cause is fixed (deleting the source disks). To be clear, you can configure to resume guest states on compute service restarts with the flag https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.resume_guests_state_on_host_boot
Closing the bug. ** Changed in: nova Status: New => Won't Fix -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1738297 Title: Nova Destroys Local Disks for Instance with Broken iSCSI Connection to Cinder Volume Upon Resume from Suspend Status in OpenStack Compute (nova): Won't Fix Bug description: Background: Libvirt + KVM cloud running Newton (but relevant code appears the same on master). Earlier this week we had some issues with a Cinder storage server (it uses LVM+iSCSI). tgt service was consuming 100% CPU (after running for several months) and Compute nodes lost iSCSI connection. I had to restart tgt, cinder-volume service, and a number of compute hosts + instances. Today, a user tried resuming their instance which was suspended before aforementioned trouble. (Note: this instance has root and ephemeral disks stored locally, third disk on shared Cinder storage). It appears (per below-linked logs) that the iSCSI connection from the compute host to the Cinder storage server was broken/missing, and because of this, Cinder apparently "cleaned up" the instance including *destroying its disk files*. Instance is now in error state. nova-compute.log: http://paste.openstack.org/show/628991/ /var/log/syslog: http://paste.openstack.org/show/628992/ We're still running Newton but the code appears the same on master. Based on the log messages ("Deleting instance files" and "Deletion of /var/lib/nova/instances/68058b22-e17f-42f7-80ff-aeb06cbc82cb_del complete"), it appears that we ended up in this function, `delete_instance_files`: https://github.com/openstack/nova/blob/stable/newton/nova/virt/libvirt/driver.py#L7745-L7801 A trace wasn't logged for this, but I'm guessing we got here from the `cleanup` function: https://github.com/openstack/nova/blob/a0e4f627f0be48db65c23f4f180d4bc6dd68cc83/nova/virt/libvirt/driver.py#L933-L1032 One of `cleanup`'s arguments is `destroy_disks=True`, so I'm guessing this was run with defaults or not overridden. (Someone, please correct me if the available data suggest otherwise!) Nobody requested a Delete action, so this appears to be Nova deciding to destroy an instance's local disks after encountering an otherwise- unhandled exception related to the iSCSI device being unavailable. I will try to reproduce and update the bug if successful. For us, losing an instance's data is a Problem -- our users (scientists) often store unique data on instances that are configured by hand. If an instance cannot be resumed, I would much rather Nova leave the instance's disks intact for investigation / data recovery, instead of throwing everything out. For deployments whose instances may contain important data, could this behavior be made configurable? Perhaps "destroy_disks_on_failed_resume = False" in nova.conf? Thank you! Chris Martin (P.S. actually a Cinder question, but someone here may know: is there something that can/should be done to re-initialize iSCSI connections between compute nodes and a Cinder storage server after a recovered failure of the iSCSI target service on the storage server?) To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1738297/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp