Re: Snapshots on KVM corrupting disk images

cloudstack-fan Sun, 14 Jul 2019 03:07:20 -0700

Dear colleagues,

Have anyone upgraded to 4.11.3? This version includes a patch that should help 
to avoid encountering with this problem: 
https://github.com/apache/cloudstack/pull/3194. It would be great to know if it 
has helped you.


Thanks in advance for sharing your experience.

Best regards,
a big CloudStack fan :)

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, 5 February 2019 12:25, cloudstack-fan 
<cloudstack-...@protonmail.com> wrote:

> And one more thought, by the way.
>
> There's a cool new feature - asynchronous backup 
> (https://cwiki.apache.org/confluence/display/CLOUDSTACK/Separate+creation+and+backup+operations+for+a+volume+snapshot).
>  It allows to create a snapshot at one moment and back it up in another. It 
> would be amazing if it gave opportunity to perform the snapshot deletion 
> procedure (I mean deletion from a primary storage) as a separate operation. 
> So I could check if I/O-activity is low before to _delete_ a snapshot from a 
> primary storage, not only before to _create_ it, it could be a nice 
> workaround.
>
> Dear colleagues, what do you think, is it doable?
>
> Thank you!
>
> Best regards,
> a big CloudStack fan :)
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Monday, 4 February 2019 07:46, cloudstack-fan 
> <cloudstack-...@protonmail.com> wrote:
>
>> By the way, RedHat recommended to suspend a VM before deleting a snapshot 
>> too: https://bugzilla.redhat.com/show_bug.cgi?id=920020. I'll quote it here:
>>
>>> 1. Pause the VM
>>>   2. Take an internal snapshot with the 'savevm' command of the qemu monitor
>>>      of the running VM, not with an external qemu-img process. virsh may or 
>>> may
>>>      not provide an interface for this.
>>>   3. You can resume the VM now
>>>   4. qemu-img convert -f qcow2 -O qcow2 -s "$SNAPDATE" $i $i-snapshot
>>>   5. Pause the VM again
>>>   6. 'delvm' in the qemu monitor
>>>   7. Resume the VM
>>
>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> On Monday, 4 February 2019 07:36, cloudstack-fan 
>> <cloudstack-...@protonmail.com> wrote:
>>
>>> I'd also like to add another detail, if no one minds.
>>>
>>> Sometimes one can run into this issue without shutting down a VM. The 
>>> disaster might occur right after a snapshot is copied to a secondary 
>>> storage and deleted from the VM's image on the primary storage. I saw it a 
>>> couple of times, when it happened to the VMs being monitored. The 
>>> monitoring suite showed that these VMs were working fine right until the 
>>> final phase (apart from a short pause of the snapshot creating stage).
>>>
>>> I also noticed that a VM is always suspended when a snapshot is being 
>>> created and `virsh list` shows it's in the "paused" state, but when a 
>>> snapshot is being deleted from the image the same command always shows the 
>>> "running" state, although the VM doesn't respond to anything during the 
>>> snapshot deletion phase.
>>>
>>> It seems to be a bug of KVM/QEMU itself, I think. Proxmox users also face 
>>> the same issue (see 
>>> https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/,
>>>  
>>> https://forum.proxmox.com/threads/proxmox-3-4-11qcow2-image-is-corrupt.25953/
>>>  and other similar threads), but it also would be great to make some 
>>> workaround for ACS. Maybe, just as you proposed, it would be wise to 
>>> suspend the VM before snapshot deletion and resume it after that. It would 
>>> give ACS a serious advantage over other orchestration systems. :-)
>>>
>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>> On Friday, 1 February 2019 22:25, Ivan Kudryavtsev 
>>> <kudryavtsev...@bw-sw.com> wrote:
>>>
>>>> Yes, only after the VM shutdown, the image is corrupted.
>>>>
>>>> пт, 1 февр. 2019 г., 15:01 Sean Lair sl...@ippathways.com:
>>>>
>>>>> Hello,
>>>>>
>>>>> We are using NFS storage.  It is actually native NFS mounts on a NetApp 
>>>>> storage system.  We haven't seen those log entries, but we also don't 
>>>>> always know when a VM gets corrupted...  When we finally get a call that 
>>>>> a VM is having issues, we've found that it was corrupted a while ago.
>>>>>
>>>>> -----Original Message-----
>>>>> From: cloudstack-fan [mailto:cloudstack-...@protonmail.com.INVALID]
>>>>> Sent: Sunday, January 27, 2019 1:45 PM
>>>>> To: us...@cloudstack.apache.org
>>>>> Cc: dev@cloudstack.apache.org
>>>>> Subject: Re: Snapshots on KVM corrupting disk images
>>>>>
>>>>> Hello Sean,
>>>>>
>>>>> It seems that you've encountered the same issue that I've been facing 
>>>>> during the last 5-6 years of using ACS with KVM hosts (see this thread, 
>>>>> if you're interested in additional details: 
>>>>> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).
>>>>>
>>>>> I'd like to state that creating snapshots of a running virtual machine is 
>>>>> a bit risky. I've implemented some workarounds in my environment, but I'm 
>>>>> still not sure that they are 100% effective.
>>>>>
>>>>> I have a couple of questions, if you don't mind. What kind of storage do 
>>>>> you use, if it's not a secret? Does you storage use XFS as a filesystem? 
>>>>> Did you see something like this in your log-files?
>>>>> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 
>>>>> 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible 
>>>>> memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) 
>>>>> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 
>>>>> 65552 in kmem_realloc (mode:0x250) Did you see any unusual messages in 
>>>>> your log-file when the disaster happened?
>>>>>
>>>>> I hope, things will be well. Wish you good luck and all the best!
>>>>>
>>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We had some instances where VM disks are becoming corrupted when using 
>>>>>> KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
>>>>>>
>>>>>> The first time was when someone mass-enabled scheduled snapshots on a 
>>>>>> lot of large number VMs and secondary storage filled up. We had to 
>>>>>> restore all those VM disks... But believed it was just our fault with 
>>>>>> letting secondary storage fill up.
>>>>>>
>>>>>> Today we had an instance where a snapshot failed and now the disk image 
>>>>>> is corrupted and the VM can't boot. here is the output of some commands:
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ------------------------------------------------
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>>>>> Could not read snapshots: File too large
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>>>>> Could not read snapshots: File too large
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> -----------------------------------------------------------
>>>>>>
>>>>>> We tried restoring to before the snapshot failure, but still have 
>>>>>> strange errors:
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> --------------
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> file format: qcow2
>>>>>> virtual size: 50G (53687091200 bytes)
>>>>>> disk size: 73G
>>>>>> cluster_size: 65536
>>>>>> Snapshot list:
>>>>>> ID TAG VM SIZE DATE VM CLOCK
>>>>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
>>>>>> 3099:35:55.242
>>>>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
>>>>>> 3431:52:23.942 Format specific information:
>>>>>> compat: 1.1
>>>>>> lazy refcounts: false
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
>>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
>>>>>> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 
>>>>>> 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 
>>>>>> 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on the 
>>>>>> image.
>>>>>>
>>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
>>>>>> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>>>> Snapshot list:
>>>>>> ID TAG VM SIZE DATE VM CLOCK
>>>>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
>>>>>> 3099:35:55.242
>>>>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
>>>>>> 3431:52:23.942
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ----------------------------------------------------------------------
>>>>>> ---------------------------------------------------------------
>>>>>>
>>>>>> Everyone is now extremely hesitant to use snapshots in KVM.... We tried 
>>>>>> deleting the snapshots in the restored disk image, but it errors out...
>>>>>>
>>>>>> Does anyone else have issues with KVM snapshots? We are considering just 
>>>>>> disabling this functionality now...
>>>>>>
>>>>>> Thanks
>>>>>> Sean

Re: Snapshots on KVM corrupting disk images

Reply via email to