Re: [PATCH v5 1/4] block: support compressed write at generic layer

Andrey Shinkevich Tue, 22 Oct 2019 07:44:30 -0700


On 22/10/2019 15:56, Max Reitz wrote:
> On 22.10.19 14:23, Vladimir Sementsov-Ogievskiy wrote:
>> 22.10.2019 14:31, Max Reitz wrote:
>>> On 22.10.19 12:46, Vladimir Sementsov-Ogievskiy wrote:
>>>> 22.10.2019 13:21, Andrey Shinkevich wrote:
>>>>>
>>>>> On 22/10/2019 12:28, Max Reitz wrote:
>>>>>> On 20.10.19 22:37, Andrey Shinkevich wrote:
>>>>>>> To inform the block layer about writing all the data compressed, we
>>>>>>> introduce the 'compress' command line option. Based on that option, the
>>>>>>> written data will be aligned by the cluster size at the generic layer.
>>>>>>>
>>>>>>> Suggested-by: Roman Kagan <rka...@virtuozzo.com>
>>>>>>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com>
>>>>>>> Signed-off-by: Andrey Shinkevich <andrey.shinkev...@virtuozzo.com>
>>>>>>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com>
>>>>>>> ---
>>>>>>>      block.c                   | 20 +++++++++++++++++++-
>>>>>>>      block/io.c                | 13 +++++++++----
>>>>>>>      block/qcow2.c             |  4 ++++
>>>>>>>      blockdev.c                |  9 ++++++++-
>>>>>>>      include/block/block.h     |  1 +
>>>>>>>      include/block/block_int.h |  2 ++
>>>>>>>      qapi/block-core.json      |  5 ++++-
>>>>>>>      qemu-options.hx           |  6 ++++--
>>>>>>>      8 files changed, 51 insertions(+), 9 deletions(-)
>>>>>>
>>>>>> The problem with compression is that there are such tight constraints on
>>>>>> it that it can really only work for very defined use cases.  Those
>>>>>> constraints are:
>>>>>>
>>>>>> - Only write whole clusters,
>>>>>> - Clusters can be written to only once.
>>>>>>
>>>>>> The first point is addressed in this patch by setting request_alignment.
>>>>>>      But I don’t see how the second one can be addressed.  Well, maybe by
>>>>>> allowing it in all drivers that support compression.  But if I just look
>>>>>> at qcow2, that isn’t going to be trivial: You need to allocate a
>>>>>> completely new cluster where you write the data (in case it grows), and
>>>>>> thus you leave behind a hole, which kind of defeats the purpose of
>>>>>> compression.
>>>>>>
>>>>>> (For demonstration:
>>>>>>
>>>>>> $ ./qemu-img create -f qcow2 test.qcow2 64M
>>>>>> Formatting 'test.qcow2', fmt=qcow2 size=67108864 cluster_size=65536
>>>>>> lazy_refcounts=off refcount_bits=16
>>>>>> $ x86_64-softmmu/qemu-system-x86_64 \
>>>>>>         -blockdev "{'node-name': 'drv0', 'driver': 'qcow2',
>>>>>>                     'compress': true,
>>>>>>                     'file': {'driver': 'file', 'filename': 
>>>>>> 'test.qcow2'}}" \
>>>>>>         -monitor stdio
>>>>>> QEMU 4.1.50 monitor - type 'help' for more information
>>>>>> (qemu) qemu-io drv0 "write -P 42 0 64k"
>>>>>> wrote 65536/65536 bytes at offset 0
>>>>>> 64 KiB, 1 ops; 00.02 sec (4.055 MiB/sec and 64.8793 ops/sec)
>>>>>> (qemu) qemu-io drv0 "write -P 23 0 64k"
>>>>>> write failed: Input/output error
>>>>>>
>>>>>> )
>>>>>>
>>>>>> Compression really only works when you fully write all of an image
>>>>>> exactly once; i.e. as the qemu-img convert or as a backup target.  For
>>>>>> both cases we already have a compression option.  So I’m wondering where
>>>>>> this new option is really useful.
>>>>>>
>>>>>> (You do add a test for stream, but I don’t know whether that’s really a
>>>>>> good example, see my response there.)
>>>>>>
>>>>>> Max
>>>>>>
>>>>>
>>>>> Thank you very much Max for your detailed response.
>>>>>
>>>>> 1) You are right that compression is used with the backup mostly. The
>>>>> option for the compression with backup would be replaced by usage at the
>>>>> block layer, with no duplication. Also, it can be useful for NBD for
>>>>> instance,
>>>>>
>>>>> $ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2
>>>>> $ sudo ./qemu-nbd -c /dev/nbd0 ./image.qcow2
>>>>> $ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10
>>>>> 10+0 records in
>>>>> 10+0 records out
>>>>> 104857600 bytes (105 MB) copied, 0,0890581 s, 1,2 GB/s
>>>>> $ sudo ./qemu-nbd -d /dev/nbd0
>>>>> $ du -sh ./image.qcow2
>>>>> 101M    ./image.qcow2
>>>>>
>>>>> and with the compression
>>>>>
>>>>> $ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2
>>>>> $ sudo ./qemu-nbd -C -c /dev/nbd0 ./image.qcow2
>>>>> $ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10
>>>>> 10+0 records in
>>>>> 10+0 records out
>>>>> 104857600 bytes (105 MB) copied, 0,076046 s, 1,4 GB/s
>>>>> $ sudo ./qemu-nbd -d /dev/nbd0
>>>>> $ du -sh ./image.qcow2
>>>>> 5,3M    ./image.qcow2
>>>
>>> That seems wrong to me.  Why not use qemu-img convert for this case?
>>>
>>> Attaching an NBD server to a compressed disk has exactly the same
>>> problem as attaching a compressed disk to a VM.  It won’t work unless
>>> the client/guest is aware of the limitations.
>>>
>>>>> The idea behind introducing the new 'compress' option is to use that
>>>>> only one instead of many other ones of such a kind.
>>>>>
>>>>> 2) You are right also that "Compression can't overwrite anything..."
>>>>> It can be seen in the commit message
>>>>> b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2
>>>>>
>>>>> I am not sure if data should be written compressed to the active layer.
>>>>> I made the tests with the idea of bringing examples of usage the
>>>>> 'compress' option because passing an option is a tricky thing in QEMU.
>>>>> But the issue takes place anyway if we want to rewrite to allocated
>>>>> clusters.
>>>>> I would like to investigate the matter and make a patch that resolves
>>>>> that issue.
>>>>> Do you agree with that?
>>>
>>> What seems wrong to me is that this series just adds a generic compress
>>> option without ensuring that it works generically.
>>>
>>> Either (1) it only works in well-defined cases, then either (1A) we have
>>> to ensure that we only allow it then (as we do now, because only
>>> qemu-img convert and the backup job have such an option; and those two
>>> are safe), or (1B) we have to clearly give a big warning for the new
>>> option that it doesn’t work correctly.  I don’t know whether such a
>>> warning is even possible with just a generic node-level option.
>>>
>>> Or (2) we make it work in generic cases.  Well, that might be possible
>>> for qcow2, but who’s going to make it work for VMDK’s streamOptimized
>>> subformat?
>>>
>>> More on all of that below.
>>>
>>>> Yes, we want this option not to allow compressed writes for guests, but to
>>>> allow
>>>> - stream with compression (used to remove compressed incremental backup, we
>>>> need to merge it to the next incremental)
>>>
>>> Based on the assumption that one shouldn’t attach a compressed disk to a
>>> VM, I don’t see how streaming makes sense then.  Well, I suppose
>>> intermediate streaming would work.
>>>
>>>> - backup with compression (we have an optional already, so it works)
>>>> - backup to nbd server with compression: enable compression on nbd server
>>>
>>> The problem is clearly that if a generic client were to connect to the
>>> NBD server, it wouldn’t work.  In this case, compression will work only
>>> if the clients understands the limitation.
>>>
>>> (The safe way would be to make the NBD server provide an option that
>>> clients can see so they know this limitation and agree they’ll adhere to
>>> it.  It’s also a stupid way.)
>>>
>>>> So instead of adding two options (for stream and for nbd), it seems better 
>>>> to
>>>> add only one for generic layer.
>>>
>>> I don’t know.  It doesn’t work generically, so I don’t know whether it
>>> should be a generic option.
>>>
>>>> Then, it becomes possible to run guest on image with compress=on. It's a 
>>>> side
>>>> effect, but still it should work correctly.
>>>
>>> How so?  compress=on only works if every party involved only writes to
>>> any cluster of the image exactly once.  That is just not the case for
>>> guests unless they know this limitation, and even I don’t see a use case.
>>>
>>>> I think the simplest thing is to just run normal write, if compressed write
>>>> failed because of reallocation. We should check that on that failure-path
>>>> ENOTSUP is returned and handle it for compress=on option by fallback to
>>>> normal write in generic block/io.c
>>>
>>> It seems wrong to not compress data with compress=on.
>>
>> We already fallback to normal write if can't compress in qcow2.
> 
> In a very specific case, namely where the compressed size is larger than
> the uncompressed size.  It would be a whole different story to only
> compress the first write to a given cluster but not any following one.
> 
>>>> (Note, that in case with stream we rewrite only unallocated clusters)
>>>
>>> Yes, but if the stream job writes the cluster before the guest, the
>>> guest gets an I/O error.
>>
>> I don't think that using compressed writes by guest is really usable thing.
>> In our case with stream there is no guest (just use qemu binary to operate
>> on block layer)
>>
>>>
>>>
>>> By the way, the other thing I wondered was whether this should be a
>>> filter driver instead (if it makes sense at all).  Such a filter driver
>>> would at least be sufficiently cumbersome to use that probably only
>>> users who understand the implications would make use of it (1B above).
>>>
>>>
>>> I’m not against adding a generic compress=on option if we ensure that it
>>> actually works generically (2 above).  It doesn’t right now, so I don’t
>>> think this is right.
>>>
>>> I’m already not sure whether it’s really possible to support generic
>>> compressed writes in qcow2.  I suppose it’d be at least awkward.  In
>>> theory it should work, it’s just that we can’t keep track of subcluster
>>> allocations, which in the worst case means that some compressed clusters
>>> take up a whole cluster of space.
>>>
>>> For VMDK...  I suppose we could consider a new flag for block drivers
>>> that flags whether a driver supports arbitrary compressed writes or has
>>> the old limitations.  compress=on could then refuse to work on any block
>>> driver but the ones that support arbitrary compressed writes.
>>>
>>>
>>> Our current model (1A) is simply to ensure that all compressed writes
>>> can adhere to the limitations.  As I’ve said above, extending this to
>>> NBD would mean adding some NBD negotiation so both client and server
>>> agree on this limitation.
>>
>> In this case, I'd prefere simple cmdline option fro qemu-nbd.
>>
>>> That seems kind of reasonable from the qemu
>>> side, but probably very unreasonable from an NBD side.  Why would NBD
>>> bother to reproduce qemu’s limitations?
>>>
>>>
>>> So these are the three ways (1A, 1B, 2) I can imagine.  But just adding
>>> a generic option that unsuspecting users are not unlikely to use but
>>> that simply doesn’t work generically doesn’t seem right to me.
>>>
>>
>> Firstly, 1A already doesn't work correctly: we have compress option for 
>> backup.
>> So, it will not work if backup target is not empty.
> 
> I’m not sure whether that qualifies because the user is simply
> responsible to ensure that the target is empty.
> 
> Otherwise, you could also argue that backup doesn’t necessarily create a
> real backup because with sync=top it won’t copy unallocated clusters, so
> if the target already has data in such a place, it won’t be overwritten.
> 
> Furthermore I’d argue this would hardly be an accident.  Both
> drive-backup with mode=existing and qemu-img convert -n are most likely
> not used blindly but only when the situation requires it.
> 
> My point is that using compress=on can be an accident because people see
> that option and think “That sounds useful!” without reading the warning.
> 
>> So, 1A is impossible, as it's already broken, adding separate options for 
>> stream
>> and qemu-nbd is not better than just add one generic option.
> 
> I disagree that 1A is broken, and I disagree a generic option would not
> be worse.  I still believe a simple generic option is prone to
> accidental misuse.
> 
>> I don't like (2) as it means a lot of effort to support actually not needed 
>> case.
> 
> Well, you say it isn’t needed.  I suppose if it were supported, people
> would indeed use it.
> 
>> I don't really like filter solution (as at seems too much to add a filter 
>> for simple
>> boolean option), but I think, we can live with it.
> 
> First, it is not a simple boolean option.  This patch brings
> implementation with it that converts writes to compressed writes and
> sets a request alignment.  I actually think it’s better to do things
> like that in a dedicated block driver than in the generic block layer code.
> 
> Second, we already this basically this for copy-on-read (which is also a
> boolean option for -drive).
> 
> (FWIW, I’d prefer if detect-zeroes were a block driver instead of
> generic block layer code.)
> 
>> 1B is OK for me, that is, just document what the option does in fact.
>>
>> May be, name the option "compress-new-writes" instead of just "compress"?
>> And document, that writes to clusters, which was not written before will be
>> compressed?
>>
>> Or make the option to be compress=<x>, where x may be 'no' or 'new-writes',
>> reserving 'all-writes' for future?
> 
> I don’t know whether the limitations can be captured in three words, but
> it would at least make users pay closer attention to what the option is
> really doing.
> 
> OTOH, the only downsides to a dedicated block driver to me appear that
> it’s more cumbersome to use (which I actually consider a benefit) and
> that it requires the usual filter driver boilerplate – but that can be
> copied from other drivers[1].
> 
> (The benefits are that we don’t need to modify generic code and that
> people need to read into the filter and its caveats before they can use it.)
> 
> Max
> 
> 
> [1] Actually, we could provide default implementations somewhere and
> make block filters use them.  That should reduce the boilerplate size.
>


If the support of COW for compressed writes is found feasible, will it 
make a sense to implement? Then this series will follow.
A comment for the option would warn a user that guest writing will work 
slower for if compress selected.

Andrey
-- 
With the best regards,
Andrey Shinkevich

Re: [PATCH v5 1/4] block: support compressed write at generic layer

Reply via email to