I'm happy the main root cause is fixed (deleting the source disks).

To be clear, you can configure to resume guest states on compute service
restarts with the flag
https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.resume_guests_state_on_host_boot

Closing the bug.


** Changed in: nova
       Status: New => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1738297

Title:
  Nova Destroys Local Disks for Instance with Broken iSCSI Connection to
  Cinder Volume Upon Resume from Suspend

Status in OpenStack Compute (nova):
  Won't Fix

Bug description:
  Background: Libvirt + KVM cloud running Newton (but relevant code
  appears the same on master). Earlier this week we had some issues with
  a Cinder storage server (it uses LVM+iSCSI). tgt service was consuming
  100% CPU (after running for several months) and Compute nodes lost
  iSCSI connection. I had to restart tgt, cinder-volume service, and a
  number of compute hosts + instances.

  Today, a user tried resuming their instance which was suspended before
  aforementioned trouble. (Note: this instance has root and ephemeral
  disks stored locally, third disk on shared Cinder storage). It appears
  (per below-linked logs) that the iSCSI connection from the compute
  host to the Cinder storage server was broken/missing, and because of
  this, Cinder apparently "cleaned up" the instance including
  *destroying its disk files*. Instance is now in error state.

  nova-compute.log: http://paste.openstack.org/show/628991/
  /var/log/syslog: http://paste.openstack.org/show/628992/

  We're still running Newton but the code appears the same on master.
  Based on the log messages ("Deleting instance files" and "Deletion of 
/var/lib/nova/instances/68058b22-e17f-42f7-80ff-aeb06cbc82cb_del complete"), it 
appears that we ended up in this function, `delete_instance_files`: 
https://github.com/openstack/nova/blob/stable/newton/nova/virt/libvirt/driver.py#L7745-L7801
  A trace wasn't logged for this, but I'm guessing we got here from the 
`cleanup` function: 
https://github.com/openstack/nova/blob/a0e4f627f0be48db65c23f4f180d4bc6dd68cc83/nova/virt/libvirt/driver.py#L933-L1032
  One of `cleanup`'s arguments is `destroy_disks=True`, so I'm guessing this 
was run with defaults or not overridden.
  (Someone, please correct me if the available data suggest otherwise!)

  Nobody requested a Delete action, so this appears to be Nova deciding
  to destroy an instance's local disks after encountering an otherwise-
  unhandled exception related to the iSCSI device being unavailable. I
  will try to reproduce and update the bug if successful.

  For us, losing an instance's data is a Problem -- our users
  (scientists) often store unique data on instances that are configured
  by hand. If an instance cannot be resumed, I would much rather Nova
  leave the instance's disks intact for investigation / data recovery,
  instead of throwing everything out. For deployments whose instances
  may contain important data, could this behavior be made configurable?
  Perhaps "destroy_disks_on_failed_resume = False" in nova.conf?

  Thank you!

  Chris Martin

  (P.S. actually a Cinder question, but someone here may know: is there
  something that can/should be done to re-initialize iSCSI connections
  between compute nodes and a Cinder storage server after a recovered
  failure of the iSCSI target service on the storage server?)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1738297/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to