[Qemu-discuss] Why are qcow2 internal snapshots (savevm and otherwise) still so slow

Jakob Bohm Mon, 03 Nov 2014 07:50:16 -0800

This is just a burp of frustration, after years of running qemu+kvm.


I am currently running qemu-system version 2.1.0 (Debian package
2.1+dfsg-2~bpo70+2), and after several years of ongoing bug reports
(and various patches)from others, snapshots are still painfully
slow.

Even simple snapshot operations take large amounts of time.
Creating and removing snapshots from a running VM makes the VM
unresponsive (pings etc. fail), often for as much as 30 minutes
to an hour.

Note that this is happening on a production system, so
experimentation is limited and unsafe cache settings are just not
an option.

In one qemu bug tracker (not sure if it is the current one), I
found reports and patches for at least 3 internal issues related
to this:

1. Unnecessary disk flushes for each update of each entry in some
qcow2 file format tables(I think it was the L1 or L2 tables).
It seems that functions intended to perform and flush single
allocations are being misused by calling them in a loop for the
bulk allocations involved in snapshots.

2. Failure to remove a temporary copy of the savevm memory image
from the HEAD of the snapshot tree (not sure why it was written
there and not in the snapshot itself).

3. Metadata pre-allocation not surviving snapshot creation/removal.

In addition to these bug reports, I have noticed in other documents,
that snapshot-related features, such as streaming blocks to combine
snapshots are inexplicably designed only to cover non-typical cases
in terms of the direction blocks are copied, compared to what is
actually needed by the snapshot commands exposed in the user
interfaces.

There is also a lack of clear rules as to how the qcow2 format
handles being backed up while "live" and later restored, then
jumped to a snapshot made just before the backup (this is the
standard scenario for snapshot-based backups and restores).  For
instance this may cause the "reference count" fields in a restored
file to be out of sync with the referring tables, if one or more
blocks were written while the backup program was reading the qcow2
file sequentially.

It seems strange that such a basic operation, using the native qemu
file format, isn't considered apriority in terms of reliability
and performance.

P.S.

In case you didn't know, the standard way to backup virtual
machines (qemu orotherwise) is this sequence:

- Create snapshot named "Backup #xxxx"using savevm.
- Sequentially copy disk image file using a tool such as gnu tar.
 (Any byte ranges that change due to the running VM may get backed
 up with their values at any time during this copy operation).
- Remove snapshot named "Backup #xxxx"

After the disaster:

- Restore disk image file as it was seen by the backup tool.
-Restart virtual machine from the disk image, memory image etc.
 represented by thesnapshot named "Backup #xxxx" (loadvm).


Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S.  http://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark.  Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded

[Qemu-discuss] Why are qcow2 internal snapshots (savevm and otherwise) still so slow

Reply via email to