Re: Snapshots on KVM corrupting disk images

2019-01-27 Thread cloudstack-fan
Hello Sean,

It seems that you've encountered the same issue that I've been facing during 
the last 5-6 years of using ACS with KVM hosts (see this thread, if you're 
interested in additional details: 
https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).

I'd like to state that creating snapshots of a running virtual machine is a bit 
risky. I've implemented some workarounds in my environment, but I'm still not 
sure that they are 100% effective.

I have a couple of questions, if you don't mind. What kind of storage do you 
use, if it's not a secret? Does you storage use XFS as a filesystem? Did you 
see something like this in your log-files?
[***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in 
kmem_realloc (mode:0x250)
[***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in 
kmem_realloc (mode:0x250)
[***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 in 
kmem_realloc (mode:0x250)
Did you see any unusual messages in your log-file when the disaster happened?

I hope, things will be well. Wish you good luck and all the best!


‐‐‐ Original Message ‐‐‐
On Tuesday, 22 January 2019 18:30, Sean Lair  wrote:

> Hi all,
>
> We had some instances where VM disks are becoming corrupted when using KVM 
> snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
>
> The first time was when someone mass-enabled scheduled snapshots on a lot of 
> large number VMs and secondary storage filled up. We had to restore all those 
> VM disks... But believed it was just our fault with letting secondary storage 
> fill up.
>
> Today we had an instance where a snapshot failed and now the disk image is 
> corrupted and the VM can't boot. here is the output of some commands:
>
> --
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could not 
> read snapshots: File too large
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': Could not 
> read snapshots: File too large
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> -rw-r--r--. 1 root root 73G Jan 22 11:04 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>
> ---
>
> We tried restoring to before the snapshot failure, but still have strange 
> errors:
>
> 
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> -rw-r--r--. 1 root root 73G Jan 22 11:04 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> file format: qcow2
> virtual size: 50G (53687091200 bytes)
> disk size: 73G
> cluster_size: 65536
> Snapshot list:
> ID TAG VM SIZE DATE VM CLOCK
> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43 3099:35:55.242
> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16 3431:52:23.942
> Format specific information:
> compat: 1.1
> lazy refcounts: false
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check 
> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3 
> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 
> 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 
> 0x55d16ddd9f7d
> No errors were found on the image.
>
> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img sn

RE: Snapshots on KVM corrupting disk images

2019-02-03 Thread cloudstack-fan
Yes, that's the scariest thing: you never know that the image is corrupted on 
the same day. Usually, a week or a fortnight could pass before one gets to know 
about a problem (and all old snapshots are successfully removed by that time).

Some time ago I implemented a simple script that runs `qemu-img check` on each 
image on a daily basis, but then I had to give this idea up, because `qemu-img 
check` usually can show a lot of errors on a running instance's volume, it 
could show some truth only when the instance is stopped. :-(

Here is a bit of advice.
1. First of all, never make a snapshot when the VM shows high I/O activity. I 
implemented an SNMP-agent that shows I/O activity of all VMs under a certain 
MIB, but I also had to implement another application to manage snapshots, it 
creates a new snapshot only when it's pretty sure that the VM doesn't write a 
lot of data to the storage. I'd gladly share it, but implementing all these 
things could be a bit tricky thing, I need some time to document it. Of course, 
you always can implement your own solution for that. Maybe it would be a nice 
idea to implement this in ACS itself. :)
2. Consider dropping caches every hour (`/bin/echo 1 > 
/proc/sys/vm/drop_caches`). I found some correlation between corrupting images 
and cache's overflow.

I'm still not 100% sure it can guarantee you calm sleeping in the night, but my 
statistics (~600 VMs on different hosts, clusters, pods and zones) show that 
implementing these things was a correct step (knocking on wood, spitting over 
the left shoulder, etc.).

Good luck!


‐‐‐ Original Message ‐‐‐
On Friday, 1 February 2019 22:01, Sean Lair  wrote:

> Hello,
>
> We are using NFS storage. It is actually native NFS mounts on a NetApp 
> storage system. We haven't seen those log entries, but we also don't always 
> know when a VM gets corrupted... When we finally get a call that a VM is 
> having issues, we've found that it was corrupted a while ago.
>
> -Original Message-
> From: cloudstack-fan [mailto:[email protected]]
> Sent: Sunday, January 27, 2019 1:45 PM
> To: [email protected]
> Cc: [email protected]
> Subject: Re: Snapshots on KVM corrupting disk images
>
> Hello Sean,
>
> It seems that you've encountered the same issue that I've been facing during 
> the last 5-6 years of using ACS with KVM hosts (see this thread, if you're 
> interested in additional details: 
> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).
>
> I'd like to state that creating snapshots of a running virtual machine is a 
> bit risky. I've implemented some workarounds in my environment, but I'm still 
> not sure that they are 100% effective.
>
> I have a couple of questions, if you don't mind. What kind of storage do you 
> use, if it's not a secret? Does you storage use XFS as a filesystem? Did you 
> see something like this in your log-files?
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 
> in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory 
> allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: 
> qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc 
> (mode:0x250) Did you see any unusual messages in your log-file when the 
> disaster happened?
>
> I hope, things will be well. Wish you good luck and all the best!
>
> ‐‐‐ Original Message ‐‐‐
> On Tuesday, 22 January 2019 18:30, Sean Lair [email protected] wrote:
>
> > Hi all,
> > We had some instances where VM disks are becoming corrupted when using KVM 
> > snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
> > The first time was when someone mass-enabled scheduled snapshots on a lot 
> > of large number VMs and secondary storage filled up. We had to restore all 
> > those VM disks... But believed it was just our fault with letting secondary 
> > storage fill up.
> > Today we had an instance where a snapshot failed and now the disk image is 
> > corrupted and the VM can't boot. here is the output of some commands:
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > Could not read snapshots: File too large
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > Could not read snapshots: File too large
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a

Re: Snapshots on KVM corrupting disk images

2019-02-03 Thread cloudstack-fan
Just like that cat in a box. The observer needs to open the box to learn if the 
cat is alive. :-)

‐‐‐ Original Message ‐‐‐
On Friday, 1 February 2019 22:25, Ivan Kudryavtsev  
wrote:

> Yes, only after the VM shutdown, the image is corrupted.
>
> пт, 1 февр. 2019 г., 15:01 Sean Lair [email protected]:
>
>> Hello,
>>
>> We are using NFS storage.  It is actually native NFS mounts on a NetApp 
>> storage system.  We haven't seen those log entries, but we also don't always 
>> know when a VM gets corrupted...  When we finally get a call that a VM is 
>> having issues, we've found that it was corrupted a while ago.
>>
>> -Original Message-
>> From: cloudstack-fan [mailto:[email protected]]
>> Sent: Sunday, January 27, 2019 1:45 PM
>> To: [email protected]
>> Cc: [email protected]
>> Subject: Re: Snapshots on KVM corrupting disk images
>>
>> Hello Sean,
>>
>> It seems that you've encountered the same issue that I've been facing during 
>> the last 5-6 years of using ACS with KVM hosts (see this thread, if you're 
>> interested in additional details: 
>> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).
>>
>> I'd like to state that creating snapshots of a running virtual machine is a 
>> bit risky. I've implemented some workarounds in my environment, but I'm 
>> still not sure that they are 100% effective.
>>
>> I have a couple of questions, if you don't mind. What kind of storage do you 
>> use, if it's not a secret? Does you storage use XFS as a filesystem? Did you 
>> see something like this in your log-files?
>> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 
>> in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory 
>> allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: 
>> qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc 
>> (mode:0x250) Did you see any unusual messages in your log-file when the 
>> disaster happened?
>>
>> I hope, things will be well. Wish you good luck and all the best!
>>
>> ‐‐‐ Original Message ‐‐‐
>> On Tuesday, 22 January 2019 18:30, Sean Lair  wrote:
>>
>>> Hi all,
>>>
>>> We had some instances where VM disks are becoming corrupted when using KVM 
>>> snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
>>>
>>> The first time was when someone mass-enabled scheduled snapshots on a lot 
>>> of large number VMs and secondary storage filled up. We had to restore all 
>>> those VM disks... But believed it was just our fault with letting secondary 
>>> storage fill up.
>>>
>>> Today we had an instance where a snapshot failed and now the disk image is 
>>> corrupted and the VM can't boot. here is the output of some commands:
>>>
>>> --
>>> --
>>> --
>>> --
>>> --
>>> --
>>> --
>>> 
>>>
>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>> Could not read snapshots: File too large
>>>
>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
>>> Could not read snapshots: File too large
>>>
>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>> -rw-r--r--. 1 root root 73G Jan 22 11:04
>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
>>>
>>> --
>>> --
>>> --
>>&g

Re: Snapshots on KVM corrupting disk images

2019-02-03 Thread cloudstack-fan
I'd also like to add another detail, if no one minds.

Sometimes one can run into this issue without shutting down a VM. The disaster 
might occur right after a snapshot is copied to a secondary storage and deleted 
from the VM's image on the primary storage. I saw it a couple of times, when it 
happened to the VMs being monitored. The monitoring suite showed that these VMs 
were working fine right until the final phase (apart from a short pause of the 
snapshot creating stage).

I also noticed that a VM is always suspended when a snapshot is being created 
and `virsh list` shows it's in the "paused" state, but when a snapshot is being 
deleted from the image the same command always shows the "running" state, 
although the VM doesn't respond to anything during the snapshot deletion phase.

It seems to be a bug of KVM/QEMU itself, I think. Proxmox users also face the 
same issue (see 
https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/,
 https://forum.proxmox.com/threads/proxmox-3-4-11qcow2-image-is-corrupt.25953/ 
and other similar threads), but it also would be great to make some workaround 
for ACS. Maybe, just as you proposed, it would be wise to suspend the VM before 
snapshot deletion and resume it after that. It would give ACS a serious 
advantage over other orchestration systems. :-)

‐‐‐ Original Message ‐‐‐
On Friday, 1 February 2019 22:25, Ivan Kudryavtsev  
wrote:

> Yes, only after the VM shutdown, the image is corrupted.
>
> пт, 1 февр. 2019 г., 15:01 Sean Lair [email protected]:
>
>> Hello,
>>
>> We are using NFS storage.  It is actually native NFS mounts on a NetApp 
>> storage system.  We haven't seen those log entries, but we also don't always 
>> know when a VM gets corrupted...  When we finally get a call that a VM is 
>> having issues, we've found that it was corrupted a while ago.
>>
>> -Original Message-
>> From: cloudstack-fan [mailto:[email protected]]
>> Sent: Sunday, January 27, 2019 1:45 PM
>> To: [email protected]
>> Cc: [email protected]
>> Subject: Re: Snapshots on KVM corrupting disk images
>>
>> Hello Sean,
>>
>> It seems that you've encountered the same issue that I've been facing during 
>> the last 5-6 years of using ACS with KVM hosts (see this thread, if you're 
>> interested in additional details: 
>> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).
>>
>> I'd like to state that creating snapshots of a running virtual machine is a 
>> bit risky. I've implemented some workarounds in my environment, but I'm 
>> still not sure that they are 100% effective.
>>
>> I have a couple of questions, if you don't mind. What kind of storage do you 
>> use, if it's not a secret? Does you storage use XFS as a filesystem? Did you 
>> see something like this in your log-files?
>> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 
>> in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory 
>> allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: 
>> qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc 
>> (mode:0x250) Did you see any unusual messages in your log-file when the 
>> disaster happened?
>>
>> I hope, things will be well. Wish you good luck and all the best!
>>
>> ‐‐‐ Original Message ‐‐‐
>> On Tuesday, 22 January 2019 18:30, Sean Lair  wrote:
>>
>>> Hi all,
>>>
>>> We had some instances where VM disks are becoming corrupted when using KVM 
>>> snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
>>>
>>> The first time was when someone mass-enabled scheduled snapshots on a lot 
>>> of large number VMs and secondary storage filled up. We had to restore all 
>>> those VM disks... But believed it was just our fault with letting secondary 
>>> storage fill up.
>>>
>>> Today we had an instance where a snapshot failed and now the disk image is 
>>> corrupted and the VM can't boot. here is the output of some commands:
>>>
>>> --
>>> --
>>> --
>>> --
>>> --
>>> ---

Re: Snapshots on KVM corrupting disk images

2019-02-03 Thread cloudstack-fan
By the way, RedHat recommended to suspend a VM before deleting a snapshot too: 
https://bugzilla.redhat.com/show_bug.cgi?id=920020. I'll quote it here:

> 1. Pause the VM
>   2. Take an internal snapshot with the 'savevm' command of the qemu monitor
>  of the running VM, not with an external qemu-img process. virsh may or 
> may
>  not provide an interface for this.
>   3. You can resume the VM now
>   4. qemu-img convert -f qcow2 -O qcow2 -s "$SNAPDATE" $i $i-snapshot
>   5. Pause the VM again
>   6. 'delvm' in the qemu monitor
>   7. Resume the VM

‐‐‐ Original Message ‐‐‐
On Monday, 4 February 2019 07:36, cloudstack-fan 
 wrote:

> I'd also like to add another detail, if no one minds.
>
> Sometimes one can run into this issue without shutting down a VM. The 
> disaster might occur right after a snapshot is copied to a secondary storage 
> and deleted from the VM's image on the primary storage. I saw it a couple of 
> times, when it happened to the VMs being monitored. The monitoring suite 
> showed that these VMs were working fine right until the final phase (apart 
> from a short pause of the snapshot creating stage).
>
> I also noticed that a VM is always suspended when a snapshot is being created 
> and `virsh list` shows it's in the "paused" state, but when a snapshot is 
> being deleted from the image the same command always shows the "running" 
> state, although the VM doesn't respond to anything during the snapshot 
> deletion phase.
>
> It seems to be a bug of KVM/QEMU itself, I think. Proxmox users also face the 
> same issue (see 
> https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/,
>  
> https://forum.proxmox.com/threads/proxmox-3-4-11qcow2-image-is-corrupt.25953/ 
> and other similar threads), but it also would be great to make some 
> workaround for ACS. Maybe, just as you proposed, it would be wise to suspend 
> the VM before snapshot deletion and resume it after that. It would give ACS a 
> serious advantage over other orchestration systems. :-)
>
> ‐‐‐ Original Message ‐‐‐
> On Friday, 1 February 2019 22:25, Ivan Kudryavtsev  
> wrote:
>
>> Yes, only after the VM shutdown, the image is corrupted.
>>
>> пт, 1 февр. 2019 г., 15:01 Sean Lair [email protected]:
>>
>>> Hello,
>>>
>>> We are using NFS storage.  It is actually native NFS mounts on a NetApp 
>>> storage system.  We haven't seen those log entries, but we also don't 
>>> always know when a VM gets corrupted...  When we finally get a call that a 
>>> VM is having issues, we've found that it was corrupted a while ago.
>>>
>>> -Original Message-
>>> From: cloudstack-fan [mailto:[email protected]]
>>> Sent: Sunday, January 27, 2019 1:45 PM
>>> To: [email protected]
>>> Cc: [email protected]
>>> Subject: Re: Snapshots on KVM corrupting disk images
>>>
>>> Hello Sean,
>>>
>>> It seems that you've encountered the same issue that I've been facing 
>>> during the last 5-6 years of using ACS with KVM hosts (see this thread, if 
>>> you're interested in additional details: 
>>> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).
>>>
>>> I'd like to state that creating snapshots of a running virtual machine is a 
>>> bit risky. I've implemented some workarounds in my environment, but I'm 
>>> still not sure that they are 100% effective.
>>>
>>> I have a couple of questions, if you don't mind. What kind of storage do 
>>> you use, if it's not a secret? Does you storage use XFS as a filesystem? 
>>> Did you see something like this in your log-files?
>>> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 
>>> in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory 
>>> allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: 
>>> qemu-kvm(***) possible memory allocation deadlock size 65552 in 
>>> kmem_realloc (mode:0x250) Did you see any unusual messages in your log-file 
>>> when the disaster happened?
>>>
>>> I hope, things will be well. Wish you good luck and all the best!
>>>
>>> ‐‐‐ Original Message ‐‐‐
>>> On Tuesday, 22 January 2019 18:30

Re: Snapshots on KVM corrupting disk images

2019-02-05 Thread cloudstack-fan
And one more thought, by the way.

There's a cool new feature - asynchronous backup 
(https://cwiki.apache.org/confluence/display/CLOUDSTACK/Separate+creation+and+backup+operations+for+a+volume+snapshot).
 It allows to create a snapshot at one moment and back it up in another. It 
would be amazing if it gave opportunity to perform the snapshot deletion 
procedure (I mean deletion from a primary storage) as a separate operation. So 
I could check if I/O-activity is low before to _delete_ a snapshot from a 
primary storage, not only before to _create_ it, it could be a nice workaround.

Dear colleagues, what do you think, is it doable?

Thank you!

Best regards,
a big CloudStack fan :)

‐‐‐ Original Message ‐‐‐
On Monday, 4 February 2019 07:46, cloudstack-fan 
 wrote:

> By the way, RedHat recommended to suspend a VM before deleting a snapshot 
> too: https://bugzilla.redhat.com/show_bug.cgi?id=920020. I'll quote it here:
>
>> 1. Pause the VM
>>   2. Take an internal snapshot with the 'savevm' command of the qemu monitor
>>  of the running VM, not with an external qemu-img process. virsh may or 
>> may
>>  not provide an interface for this.
>>   3. You can resume the VM now
>>   4. qemu-img convert -f qcow2 -O qcow2 -s "$SNAPDATE" $i $i-snapshot
>>   5. Pause the VM again
>>   6. 'delvm' in the qemu monitor
>>   7. Resume the VM
>
> ‐‐‐ Original Message ‐‐‐
> On Monday, 4 February 2019 07:36, cloudstack-fan 
>  wrote:
>
>> I'd also like to add another detail, if no one minds.
>>
>> Sometimes one can run into this issue without shutting down a VM. The 
>> disaster might occur right after a snapshot is copied to a secondary storage 
>> and deleted from the VM's image on the primary storage. I saw it a couple of 
>> times, when it happened to the VMs being monitored. The monitoring suite 
>> showed that these VMs were working fine right until the final phase (apart 
>> from a short pause of the snapshot creating stage).
>>
>> I also noticed that a VM is always suspended when a snapshot is being 
>> created and `virsh list` shows it's in the "paused" state, but when a 
>> snapshot is being deleted from the image the same command always shows the 
>> "running" state, although the VM doesn't respond to anything during the 
>> snapshot deletion phase.
>>
>> It seems to be a bug of KVM/QEMU itself, I think. Proxmox users also face 
>> the same issue (see 
>> https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/,
>>  
>> https://forum.proxmox.com/threads/proxmox-3-4-11qcow2-image-is-corrupt.25953/
>>  and other similar threads), but it also would be great to make some 
>> workaround for ACS. Maybe, just as you proposed, it would be wise to suspend 
>> the VM before snapshot deletion and resume it after that. It would give ACS 
>> a serious advantage over other orchestration systems. :-)
>>
>> ‐‐‐ Original Message ‐‐‐
>> On Friday, 1 February 2019 22:25, Ivan Kudryavtsev 
>>  wrote:
>>
>>> Yes, only after the VM shutdown, the image is corrupted.
>>>
>>> пт, 1 февр. 2019 г., 15:01 Sean Lair [email protected]:
>>>
>>>> Hello,
>>>>
>>>> We are using NFS storage.  It is actually native NFS mounts on a NetApp 
>>>> storage system.  We haven't seen those log entries, but we also don't 
>>>> always know when a VM gets corrupted...  When we finally get a call that a 
>>>> VM is having issues, we've found that it was corrupted a while ago.
>>>>
>>>> -Original Message-
>>>> From: cloudstack-fan [mailto:[email protected]]
>>>> Sent: Sunday, January 27, 2019 1:45 PM
>>>> To: [email protected]
>>>> Cc: [email protected]
>>>> Subject: Re: Snapshots on KVM corrupting disk images
>>>>
>>>> Hello Sean,
>>>>
>>>> It seems that you've encountered the same issue that I've been facing 
>>>> during the last 5-6 years of using ACS with KVM hosts (see this thread, if 
>>>> you're interested in additional details: 
>>>> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).
>>>>
>>>> I'd like to state that creating snapshots of a running virtual machine is 
>>>> a bit risky. I've implemented some workarounds in my environment, but I'm 
>>>> still not sure that they are 100% effective.
>>>&g

Re: Snapshots on KVM corrupting disk images

2019-07-14 Thread cloudstack-fan
Dear colleagues,

Have anyone upgraded to 4.11.3? This version includes a patch that should help 
to avoid encountering with this problem: 
https://github.com/apache/cloudstack/pull/3194. It would be great to know if it 
has helped you.

Thanks in advance for sharing your experience.

Best regards,
a big CloudStack fan :)

‐‐‐ Original Message ‐‐‐
On Tuesday, 5 February 2019 12:25, cloudstack-fan 
 wrote:

> And one more thought, by the way.
>
> There's a cool new feature - asynchronous backup 
> (https://cwiki.apache.org/confluence/display/CLOUDSTACK/Separate+creation+and+backup+operations+for+a+volume+snapshot).
>  It allows to create a snapshot at one moment and back it up in another. It 
> would be amazing if it gave opportunity to perform the snapshot deletion 
> procedure (I mean deletion from a primary storage) as a separate operation. 
> So I could check if I/O-activity is low before to _delete_ a snapshot from a 
> primary storage, not only before to _create_ it, it could be a nice 
> workaround.
>
> Dear colleagues, what do you think, is it doable?
>
> Thank you!
>
> Best regards,
> a big CloudStack fan :)
>
> ‐‐‐ Original Message ‐‐‐
> On Monday, 4 February 2019 07:46, cloudstack-fan 
>  wrote:
>
>> By the way, RedHat recommended to suspend a VM before deleting a snapshot 
>> too: https://bugzilla.redhat.com/show_bug.cgi?id=920020. I'll quote it here:
>>
>>> 1. Pause the VM
>>>   2. Take an internal snapshot with the 'savevm' command of the qemu monitor
>>>  of the running VM, not with an external qemu-img process. virsh may or 
>>> may
>>>  not provide an interface for this.
>>>   3. You can resume the VM now
>>>   4. qemu-img convert -f qcow2 -O qcow2 -s "$SNAPDATE" $i $i-snapshot
>>>   5. Pause the VM again
>>>   6. 'delvm' in the qemu monitor
>>>   7. Resume the VM
>>
>> ‐‐‐ Original Message ‐‐‐
>> On Monday, 4 February 2019 07:36, cloudstack-fan 
>>  wrote:
>>
>>> I'd also like to add another detail, if no one minds.
>>>
>>> Sometimes one can run into this issue without shutting down a VM. The 
>>> disaster might occur right after a snapshot is copied to a secondary 
>>> storage and deleted from the VM's image on the primary storage. I saw it a 
>>> couple of times, when it happened to the VMs being monitored. The 
>>> monitoring suite showed that these VMs were working fine right until the 
>>> final phase (apart from a short pause of the snapshot creating stage).
>>>
>>> I also noticed that a VM is always suspended when a snapshot is being 
>>> created and `virsh list` shows it's in the "paused" state, but when a 
>>> snapshot is being deleted from the image the same command always shows the 
>>> "running" state, although the VM doesn't respond to anything during the 
>>> snapshot deletion phase.
>>>
>>> It seems to be a bug of KVM/QEMU itself, I think. Proxmox users also face 
>>> the same issue (see 
>>> https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/,
>>>  
>>> https://forum.proxmox.com/threads/proxmox-3-4-11qcow2-image-is-corrupt.25953/
>>>  and other similar threads), but it also would be great to make some 
>>> workaround for ACS. Maybe, just as you proposed, it would be wise to 
>>> suspend the VM before snapshot deletion and resume it after that. It would 
>>> give ACS a serious advantage over other orchestration systems. :-)
>>>
>>> ‐‐‐ Original Message ‐‐‐
>>> On Friday, 1 February 2019 22:25, Ivan Kudryavtsev 
>>>  wrote:
>>>
>>>> Yes, only after the VM shutdown, the image is corrupted.
>>>>
>>>> пт, 1 февр. 2019 г., 15:01 Sean Lair [email protected]:
>>>>
>>>>> Hello,
>>>>>
>>>>> We are using NFS storage.  It is actually native NFS mounts on a NetApp 
>>>>> storage system.  We haven't seen those log entries, but we also don't 
>>>>> always know when a VM gets corrupted...  When we finally get a call that 
>>>>> a VM is having issues, we've found that it was corrupted a while ago.
>>>>>
>>>>> -Original Message-
>>>>> From: cloudstack-fan [mailto:[email protected]]
>>>>> Sent: Sunday, January 27, 2019 1:45 PM
>>>>> To: [email protected]
>>>>> Cc: [email protected]
>>

qemu2 images are being corrupted

2018-07-02 Thread cloudstack-fan
Dear colleagues,

I'm posting as an anonymous user, because there's a thing that concerns me a 
little and I'd like to share my experience with you, so maybe some people could 
relate to the same. ACS is amazing, it solves my tasks for 6 years, I'm running 
a few ACS-backed clouds that contain hundreds and hundreds of VMs. I'm enjoying 
ACS really much, but there's a thing that scares me sometimes.

It happens pretty seldom, but the more VMs you have is the more chances you run 
into this glitch. It usually happens on the sly and you don't get any error 
messages in log-files of your cloudstack-management server or a 
cloudstack-agent, so you don't even know that something had happened until you 
see that a virtual machine is having major problems. If you're lucky, you see 
it on the same day when it happens, but if you aren't - you won't suspect 
anything unusual for a week, but at some moment you realize that the filesystem 
had become a mess and you can't do anything to restore it. You're trying to 
restore it from a snapshot, but if you don't have a snapshot that would be 
created before the incident, your snapshots won't help. :-(

I experienced it for about 5-7 times during the last 5-6 years and there are a 
few conditions that always present:
 * it happens on KVM-based hosts (I experienced itt with CentOS 6 and CentOS 7) 
with qcow2-images (either 0.10 and 1.1 versions);
 * it happens on primary storages running different filesystems (I experiences 
it with local XFS and network-based GFS2 and NFS);
 * it happens when a volume snapshot is being made, according to the log-files 
inside of a VM (guest's operating system's kernel starts complaining on a 
filesystem errors);
 * at the same time, as I wrote before, there are NO error messages in the 
log-files outside of a VM which disk image is corrupted;
 * but when you run `qemu-img check ...` to check the image, you may see a lot 
of leaked clusters (that's why I'd strongly advice to check each and every 
image one each and every primary storage at least once per hour by a script 
being run by your monitoring system, something kind of `for imagefile in $(find 
/var/lib/libvirt/images -maxdepth 1 -type f); do { /usr/bin/qemu-img check 
"${imagfile}"; if [[ ${?} -ne 0 ]]; then { ... } fi; } done`);
 * when it happens you can also find a record in the snapshot_store_ref table 
that refers to the snapshot on a primary storage (see an example here 
https://pastebin.com/BuxCXVSq) - this record should have been removed when the 
snapshot's state is being changed from "BackingUp" to "BackedUp", but it isn't 
being removed in this case. At the same time, this snapshot isn't being listed 
in the output of `qemu-img snapshot -l ...`, so that's why I suppose that the 
image is being corrupted when ACS deletes the snapshot that has been backed up 
(it tries to delete the snapshot, but something goes wrong, image is being 
corrupted, but ACS thinks that everything's fine and changes the status to 
"BackedUp" without a bit of qualm);
 * if you're trying to restore this VM's image from the same snapshot that has 
caused destruction or any other snapshot that has been made after that, you'll 
find the same corrupted filesystem inside, but the snapshot's image that is 
stored in your secondary storage doesn't show anything wrong when you run 
`qemu-img check ...` (so you can restore your image only if you have a snapshot 
that had been created AND stored before the incident).

As I wrote, I saw several times in different environments and different 
versions of ACS. I'm pretty sure that it's not only me who had such a luck to 
experience the same glitch, so let's share our stories. Maybe together we'll 
find out why does it happen and how to prevent that in future.

Thanks in advance,
An Anonymous ACS Fan

Re: qcow2 images are being corrupted

2018-07-02 Thread cloudstack-fan
Great, thank you very much for your voice.

Would you like to share some details? What was the environmenr? How did you 
understood that it's somehow related to snapshots?

‐‐‐ Original Message ‐‐‐
On July 2, 2018 1:47 PM, Ivan Kudryavtsev  wrote:

> Hello, I also met that in the past once. I bet it's closely connected to qemu 
> snapshots.
>
> 2018-07-02 16:21 GMT+07:00 cloudstack-fan 
> :
>
>> Dear colleagues,
>>
>> I'm posting as an anonymous user, because there's a thing that concerns me a 
>> little and I'd like to share my experience with you, so maybe some people 
>> could relate to the same. ACS is amazing, it solves my tasks for 6 years, 
>> I'm running a few ACS-backed clouds that contain hundreds and hundreds of 
>> VMs. I'm enjoying ACS really much, but there's a thing that scares me 
>> sometimes.
>>
>> It happens pretty seldom, but the more VMs you have is the more chances you 
>> run into this glitch. It usually happens on the sly and you don't get any 
>> error messages in log-files of your cloudstack-management server or a 
>> cloudstack-agent, so you don't even know that something had happened until 
>> you see that a virtual machine is having major problems. If you're lucky, 
>> you see it on the same day when it happens, but if you aren't - you won't 
>> suspect anything unusual for a week, but at some moment you realize that the 
>> filesystem had become a mess and you can't do anything to restore it. You're 
>> trying to restore it from a snapshot, but if you don't have a snapshot that 
>> would be created before the incident, your snapshots won't help. :-(
>>
>> I experienced it for about 5-7 times during the last 5-6 years and there are 
>> a few conditions that always present:
>>  * it happens on KVM-based hosts (I experienced itt with CentOS 6 and CentOS 
>> 7) with qcow2-images (either 0.10 and 1.1 versions);
>>  * it happens on primary storages running different filesystems (I 
>> experiences it with local XFS and network-based GFS2 and NFS);
>>  * it happens when a volume snapshot is being made, according to the 
>> log-files inside of a VM (guest's operating system's kernel starts 
>> complaining on a filesystem errors);
>>  * at the same time, as I wrote before, there are NO error messages in the 
>> log-files outside of a VM which disk image is corrupted;
>>  * but when you run `qemu-img check ...` to check the image, you may see a 
>> lot of leaked clusters (that's why I'd strongly advice to check each and 
>> every image one each and every primary storage at least once per hour by a 
>> script being run by your monitoring system, something kind of `for imagefile 
>> in $(find /var/lib/libvirt/images -maxdepth 1 -type f); do { 
>> /usr/bin/qemu-img check "${imagfile}"; if [[ ${?} -ne 0 ]]; then { ... } fi; 
>> } done`);
>>  * when it happens you can also find a record in the snapshot_store_ref 
>> table that refers to the snapshot on a primary storage (see an example here 
>> https://pastebin.com/BuxCXVSq) - this record should have been removed when 
>> the snapshot's state is being changed from "BackingUp" to "BackedUp", but it 
>> isn't being removed in this case. At the same time, this snapshot isn't 
>> being listed in the output of `qemu-img snapshot -l ...`, so that's why I 
>> suppose that the image is being corrupted when ACS deletes the snapshot that 
>> has been backed up (it tries to delete the snapshot, but something goes 
>> wrong, image is being corrupted, but ACS thinks that everything's fine and 
>> changes the status to "BackedUp" without a bit of qualm);
>>  * if you're trying to restore this VM's image from the same snapshot that 
>> has caused destruction or any other snapshot that has been made after that, 
>> you'll find the same corrupted filesystem inside, but the snapshot's image 
>> that is stored in your secondary storage doesn't show anything wrong when 
>> you run `qemu-img check ...` (so you can restore your image only if you have 
>> a snapshot that had been created AND stored before the incident).
>>
>> As I wrote, I saw several times in different environments and different 
>> versions of ACS. I'm pretty sure that it's not only me who had such a luck 
>> to experience the same glitch, so let's share our stories. Maybe together 
>> we'll find out why does it happen and how to prevent that in future.
>>
>> Thanks in advance,
>> An Anonymous ACS Fan
>
> --
> With best regards, Ivan Kudryavtsev
> Bitworks Software, Ltd.
> Cell: +7-923-414-1515
> WWW: [http://bitworks.software/](http://bw-sw.com/)

Re: qemu2 images are being corrupted

2018-08-18 Thread cloudstack-fan
Dear colleagues,

You might find it interesting:
https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/

It seems that qemu-kvm really could corrupt a QCOW2 image. :-(

What do you think, is that possible to avoid that? Maybe there's an option to 
use RAW forman instead of QCOW2?

Thanks!

‐‐‐ Original Message ‐‐‐
On 2 July 2018 12:21 PM, cloudstack-fan  wrote:

> Dear colleagues,
>
> I'm posting as an anonymous user, because there's a thing that concerns me a 
> little and I'd like to share my experience with you, so maybe some people 
> could relate to the same. ACS is amazing, it solves my tasks for 6 years, I'm 
> running a few ACS-backed clouds that contain hundreds and hundreds of VMs. 
> I'm enjoying ACS really much, but there's a thing that scares me sometimes.
>
> It happens pretty seldom, but the more VMs you have is the more chances you 
> run into this glitch. It usually happens on the sly and you don't get any 
> error messages in log-files of your cloudstack-management server or a 
> cloudstack-agent, so you don't even know that something had happened until 
> you see that a virtual machine is having major problems. If you're lucky, you 
> see it on the same day when it happens, but if you aren't - you won't suspect 
> anything unusual for a week, but at some moment you realize that the 
> filesystem had become a mess and you can't do anything to restore it. You're 
> trying to restore it from a snapshot, but if you don't have a snapshot that 
> would be created before the incident, your snapshots won't help. :-(
>
> I experienced it for about 5-7 times during the last 5-6 years and there are 
> a few conditions that always present:
>  * it happens on KVM-based hosts (I experienced itt with CentOS 6 and CentOS 
> 7) with qcow2-images (either 0.10 and 1.1 versions);
>  * it happens on primary storages running different filesystems (I 
> experiences it with local XFS and network-based GFS2 and NFS);
>  * it happens when a volume snapshot is being made, according to the 
> log-files inside of a VM (guest's operating system's kernel starts 
> complaining on a filesystem errors);
>  * at the same time, as I wrote before, there are NO error messages in the 
> log-files outside of a VM which disk image is corrupted;
>  * but when you run `qemu-img check ...` to check the image, you may see a 
> lot of leaked clusters (that's why I'd strongly advice to check each and 
> every image one each and every primary storage at least once per hour by a 
> script being run by your monitoring system, something kind of `for imagefile 
> in $(find /var/lib/libvirt/images -maxdepth 1 -type f); do { 
> /usr/bin/qemu-img check "${imagfile}"; if [[ ${?} -ne 0 ]]; then { ... } fi; 
> } done`);
>  * when it happens you can also find a record in the snapshot_store_ref table 
> that refers to the snapshot on a primary storage (see an example here 
> https://pastebin.com/BuxCXVSq) - this record should have been removed when 
> the snapshot's state is being changed from "BackingUp" to "BackedUp", but it 
> isn't being removed in this case. At the same time, this snapshot isn't being 
> listed in the output of `qemu-img snapshot -l ...`, so that's why I suppose 
> that the image is being corrupted when ACS deletes the snapshot that has been 
> backed up (it tries to delete the snapshot, but something goes wrong, image 
> is being corrupted, but ACS thinks that everything's fine and changes the 
> status to "BackedUp" without a bit of qualm);
>  * if you're trying to restore this VM's image from the same snapshot that 
> has caused destruction or any other snapshot that has been made after that, 
> you'll find the same corrupted filesystem inside, but the snapshot's image 
> that is stored in your secondary storage doesn't show anything wrong when you 
> run `qemu-img check ...` (so you can restore your image only if you have a 
> snapshot that had been created AND stored before the incident).
>
> As I wrote, I saw several times in different environments and different 
> versions of ACS. I'm pretty sure that it's not only me who had such a luck to 
> experience the same glitch, so let's share our stories. Maybe together we'll 
> find out why does it happen and how to prevent that in future.
>
> Thanks in advance,
> An Anonymous ACS Fan