Dear colleagues, Have anyone upgraded to 4.11.3? This version includes a patch that should help to avoid encountering with this problem: https://github.com/apache/cloudstack/pull/3194. It would be great to know if it has helped you.
Thanks in advance for sharing your experience. Best regards, a big CloudStack fan :) ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Tuesday, 5 February 2019 12:25, cloudstack-fan <cloudstack-...@protonmail.com> wrote: > And one more thought, by the way. > > There's a cool new feature - asynchronous backup > (https://cwiki.apache.org/confluence/display/CLOUDSTACK/Separate+creation+and+backup+operations+for+a+volume+snapshot). > It allows to create a snapshot at one moment and back it up in another. It > would be amazing if it gave opportunity to perform the snapshot deletion > procedure (I mean deletion from a primary storage) as a separate operation. > So I could check if I/O-activity is low before to _delete_ a snapshot from a > primary storage, not only before to _create_ it, it could be a nice > workaround. > > Dear colleagues, what do you think, is it doable? > > Thank you! > > Best regards, > a big CloudStack fan :) > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Monday, 4 February 2019 07:46, cloudstack-fan > <cloudstack-...@protonmail.com> wrote: > >> By the way, RedHat recommended to suspend a VM before deleting a snapshot >> too: https://bugzilla.redhat.com/show_bug.cgi?id=920020. I'll quote it here: >> >>> 1. Pause the VM >>> 2. Take an internal snapshot with the 'savevm' command of the qemu monitor >>> of the running VM, not with an external qemu-img process. virsh may or >>> may >>> not provide an interface for this. >>> 3. You can resume the VM now >>> 4. qemu-img convert -f qcow2 -O qcow2 -s "$SNAPDATE" $i $i-snapshot >>> 5. Pause the VM again >>> 6. 'delvm' in the qemu monitor >>> 7. Resume the VM >> >> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >> On Monday, 4 February 2019 07:36, cloudstack-fan >> <cloudstack-...@protonmail.com> wrote: >> >>> I'd also like to add another detail, if no one minds. >>> >>> Sometimes one can run into this issue without shutting down a VM. The >>> disaster might occur right after a snapshot is copied to a secondary >>> storage and deleted from the VM's image on the primary storage. I saw it a >>> couple of times, when it happened to the VMs being monitored. The >>> monitoring suite showed that these VMs were working fine right until the >>> final phase (apart from a short pause of the snapshot creating stage). >>> >>> I also noticed that a VM is always suspended when a snapshot is being >>> created and `virsh list` shows it's in the "paused" state, but when a >>> snapshot is being deleted from the image the same command always shows the >>> "running" state, although the VM doesn't respond to anything during the >>> snapshot deletion phase. >>> >>> It seems to be a bug of KVM/QEMU itself, I think. Proxmox users also face >>> the same issue (see >>> https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/, >>> >>> https://forum.proxmox.com/threads/proxmox-3-4-11qcow2-image-is-corrupt.25953/ >>> and other similar threads), but it also would be great to make some >>> workaround for ACS. Maybe, just as you proposed, it would be wise to >>> suspend the VM before snapshot deletion and resume it after that. It would >>> give ACS a serious advantage over other orchestration systems. :-) >>> >>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >>> On Friday, 1 February 2019 22:25, Ivan Kudryavtsev >>> <kudryavtsev...@bw-sw.com> wrote: >>> >>>> Yes, only after the VM shutdown, the image is corrupted. >>>> >>>> пт, 1 февр. 2019 г., 15:01 Sean Lair sl...@ippathways.com: >>>> >>>>> Hello, >>>>> >>>>> We are using NFS storage. It is actually native NFS mounts on a NetApp >>>>> storage system. We haven't seen those log entries, but we also don't >>>>> always know when a VM gets corrupted... When we finally get a call that >>>>> a VM is having issues, we've found that it was corrupted a while ago. >>>>> >>>>> -----Original Message----- >>>>> From: cloudstack-fan [mailto:cloudstack-...@protonmail.com.INVALID] >>>>> Sent: Sunday, January 27, 2019 1:45 PM >>>>> To: us...@cloudstack.apache.org >>>>> Cc: dev@cloudstack.apache.org >>>>> Subject: Re: Snapshots on KVM corrupting disk images >>>>> >>>>> Hello Sean, >>>>> >>>>> It seems that you've encountered the same issue that I've been facing >>>>> during the last 5-6 years of using ACS with KVM hosts (see this thread, >>>>> if you're interested in additional details: >>>>> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser). >>>>> >>>>> I'd like to state that creating snapshots of a running virtual machine is >>>>> a bit risky. I've implemented some workarounds in my environment, but I'm >>>>> still not sure that they are 100% effective. >>>>> >>>>> I have a couple of questions, if you don't mind. What kind of storage do >>>>> you use, if it's not a secret? Does you storage use XFS as a filesystem? >>>>> Did you see something like this in your log-files? >>>>> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size >>>>> 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible >>>>> memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) >>>>> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size >>>>> 65552 in kmem_realloc (mode:0x250) Did you see any unusual messages in >>>>> your log-file when the disaster happened? >>>>> >>>>> I hope, things will be well. Wish you good luck and all the best! >>>>> >>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >>>>> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> We had some instances where VM disks are becoming corrupted when using >>>>>> KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7. >>>>>> >>>>>> The first time was when someone mass-enabled scheduled snapshots on a >>>>>> lot of large number VMs and secondary storage filled up. We had to >>>>>> restore all those VM disks... But believed it was just our fault with >>>>>> letting secondary storage fill up. >>>>>> >>>>>> Today we had an instance where a snapshot failed and now the disk image >>>>>> is corrupted and the VM can't boot. here is the output of some commands: >>>>>> >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ------------------------------------------------ >>>>>> >>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check >>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': >>>>>> Could not read snapshots: File too large >>>>>> >>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info >>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': >>>>>> Could not read snapshots: File too large >>>>>> >>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh >>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>>> -rw-r--r--. 1 root root 73G Jan 22 11:04 >>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>>> >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ----------------------------------------------------------- >>>>>> >>>>>> We tried restoring to before the snapshot failure, but still have >>>>>> strange errors: >>>>>> >>>>>> ---------------------------------------------------------------------- >>>>>> -------------- >>>>>> >>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh >>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>>> -rw-r--r--. 1 root root 73G Jan 22 11:04 >>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>>> >>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info >>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>>> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>>> file format: qcow2 >>>>>> virtual size: 50G (53687091200 bytes) >>>>>> disk size: 73G >>>>>> cluster_size: 65536 >>>>>> Snapshot list: >>>>>> ID TAG VM SIZE DATE VM CLOCK >>>>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43 >>>>>> 3099:35:55.242 >>>>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16 >>>>>> 3431:52:23.942 Format specific information: >>>>>> compat: 1.1 >>>>>> lazy refcounts: false >>>>>> >>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check >>>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>>> tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3 >>>>>> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc >>>>>> 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db >>>>>> 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on the >>>>>> image. >>>>>> >>>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img >>>>>> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>>> Snapshot list: >>>>>> ID TAG VM SIZE DATE VM CLOCK >>>>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43 >>>>>> 3099:35:55.242 >>>>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16 >>>>>> 3431:52:23.942 >>>>>> >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> ---------------------------------------------------------------------- >>>>>> --------------------------------------------------------------- >>>>>> >>>>>> Everyone is now extremely hesitant to use snapshots in KVM.... We tried >>>>>> deleting the snapshots in the restored disk image, but it errors out... >>>>>> >>>>>> Does anyone else have issues with KVM snapshots? We are considering just >>>>>> disabling this functionality now... >>>>>> >>>>>> Thanks >>>>>> Sean