On 22/10/2019 15:56, Max Reitz wrote: > On 22.10.19 14:23, Vladimir Sementsov-Ogievskiy wrote: >> 22.10.2019 14:31, Max Reitz wrote: >>> On 22.10.19 12:46, Vladimir Sementsov-Ogievskiy wrote: >>>> 22.10.2019 13:21, Andrey Shinkevich wrote: >>>>> >>>>> On 22/10/2019 12:28, Max Reitz wrote: >>>>>> On 20.10.19 22:37, Andrey Shinkevich wrote: >>>>>>> To inform the block layer about writing all the data compressed, we >>>>>>> introduce the 'compress' command line option. Based on that option, the >>>>>>> written data will be aligned by the cluster size at the generic layer. >>>>>>> >>>>>>> Suggested-by: Roman Kagan <rka...@virtuozzo.com> >>>>>>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com> >>>>>>> Signed-off-by: Andrey Shinkevich <andrey.shinkev...@virtuozzo.com> >>>>>>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsement...@virtuozzo.com> >>>>>>> --- >>>>>>> block.c | 20 +++++++++++++++++++- >>>>>>> block/io.c | 13 +++++++++---- >>>>>>> block/qcow2.c | 4 ++++ >>>>>>> blockdev.c | 9 ++++++++- >>>>>>> include/block/block.h | 1 + >>>>>>> include/block/block_int.h | 2 ++ >>>>>>> qapi/block-core.json | 5 ++++- >>>>>>> qemu-options.hx | 6 ++++-- >>>>>>> 8 files changed, 51 insertions(+), 9 deletions(-) >>>>>> >>>>>> The problem with compression is that there are such tight constraints on >>>>>> it that it can really only work for very defined use cases. Those >>>>>> constraints are: >>>>>> >>>>>> - Only write whole clusters, >>>>>> - Clusters can be written to only once. >>>>>> >>>>>> The first point is addressed in this patch by setting request_alignment. >>>>>> But I don’t see how the second one can be addressed. Well, maybe by >>>>>> allowing it in all drivers that support compression. But if I just look >>>>>> at qcow2, that isn’t going to be trivial: You need to allocate a >>>>>> completely new cluster where you write the data (in case it grows), and >>>>>> thus you leave behind a hole, which kind of defeats the purpose of >>>>>> compression. >>>>>> >>>>>> (For demonstration: >>>>>> >>>>>> $ ./qemu-img create -f qcow2 test.qcow2 64M >>>>>> Formatting 'test.qcow2', fmt=qcow2 size=67108864 cluster_size=65536 >>>>>> lazy_refcounts=off refcount_bits=16 >>>>>> $ x86_64-softmmu/qemu-system-x86_64 \ >>>>>> -blockdev "{'node-name': 'drv0', 'driver': 'qcow2', >>>>>> 'compress': true, >>>>>> 'file': {'driver': 'file', 'filename': >>>>>> 'test.qcow2'}}" \ >>>>>> -monitor stdio >>>>>> QEMU 4.1.50 monitor - type 'help' for more information >>>>>> (qemu) qemu-io drv0 "write -P 42 0 64k" >>>>>> wrote 65536/65536 bytes at offset 0 >>>>>> 64 KiB, 1 ops; 00.02 sec (4.055 MiB/sec and 64.8793 ops/sec) >>>>>> (qemu) qemu-io drv0 "write -P 23 0 64k" >>>>>> write failed: Input/output error >>>>>> >>>>>> ) >>>>>> >>>>>> Compression really only works when you fully write all of an image >>>>>> exactly once; i.e. as the qemu-img convert or as a backup target. For >>>>>> both cases we already have a compression option. So I’m wondering where >>>>>> this new option is really useful. >>>>>> >>>>>> (You do add a test for stream, but I don’t know whether that’s really a >>>>>> good example, see my response there.) >>>>>> >>>>>> Max >>>>>> >>>>> >>>>> Thank you very much Max for your detailed response. >>>>> >>>>> 1) You are right that compression is used with the backup mostly. The >>>>> option for the compression with backup would be replaced by usage at the >>>>> block layer, with no duplication. Also, it can be useful for NBD for >>>>> instance, >>>>> >>>>> $ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2 >>>>> $ sudo ./qemu-nbd -c /dev/nbd0 ./image.qcow2 >>>>> $ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10 >>>>> 10+0 records in >>>>> 10+0 records out >>>>> 104857600 bytes (105 MB) copied, 0,0890581 s, 1,2 GB/s >>>>> $ sudo ./qemu-nbd -d /dev/nbd0 >>>>> $ du -sh ./image.qcow2 >>>>> 101M ./image.qcow2 >>>>> >>>>> and with the compression >>>>> >>>>> $ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2 >>>>> $ sudo ./qemu-nbd -C -c /dev/nbd0 ./image.qcow2 >>>>> $ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10 >>>>> 10+0 records in >>>>> 10+0 records out >>>>> 104857600 bytes (105 MB) copied, 0,076046 s, 1,4 GB/s >>>>> $ sudo ./qemu-nbd -d /dev/nbd0 >>>>> $ du -sh ./image.qcow2 >>>>> 5,3M ./image.qcow2 >>> >>> That seems wrong to me. Why not use qemu-img convert for this case? >>> >>> Attaching an NBD server to a compressed disk has exactly the same >>> problem as attaching a compressed disk to a VM. It won’t work unless >>> the client/guest is aware of the limitations. >>> >>>>> The idea behind introducing the new 'compress' option is to use that >>>>> only one instead of many other ones of such a kind. >>>>> >>>>> 2) You are right also that "Compression can't overwrite anything..." >>>>> It can be seen in the commit message >>>>> b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2 >>>>> >>>>> I am not sure if data should be written compressed to the active layer. >>>>> I made the tests with the idea of bringing examples of usage the >>>>> 'compress' option because passing an option is a tricky thing in QEMU. >>>>> But the issue takes place anyway if we want to rewrite to allocated >>>>> clusters. >>>>> I would like to investigate the matter and make a patch that resolves >>>>> that issue. >>>>> Do you agree with that? >>> >>> What seems wrong to me is that this series just adds a generic compress >>> option without ensuring that it works generically. >>> >>> Either (1) it only works in well-defined cases, then either (1A) we have >>> to ensure that we only allow it then (as we do now, because only >>> qemu-img convert and the backup job have such an option; and those two >>> are safe), or (1B) we have to clearly give a big warning for the new >>> option that it doesn’t work correctly. I don’t know whether such a >>> warning is even possible with just a generic node-level option. >>> >>> Or (2) we make it work in generic cases. Well, that might be possible >>> for qcow2, but who’s going to make it work for VMDK’s streamOptimized >>> subformat? >>> >>> More on all of that below. >>> >>>> Yes, we want this option not to allow compressed writes for guests, but to >>>> allow >>>> - stream with compression (used to remove compressed incremental backup, we >>>> need to merge it to the next incremental) >>> >>> Based on the assumption that one shouldn’t attach a compressed disk to a >>> VM, I don’t see how streaming makes sense then. Well, I suppose >>> intermediate streaming would work. >>> >>>> - backup with compression (we have an optional already, so it works) >>>> - backup to nbd server with compression: enable compression on nbd server >>> >>> The problem is clearly that if a generic client were to connect to the >>> NBD server, it wouldn’t work. In this case, compression will work only >>> if the clients understands the limitation. >>> >>> (The safe way would be to make the NBD server provide an option that >>> clients can see so they know this limitation and agree they’ll adhere to >>> it. It’s also a stupid way.) >>> >>>> So instead of adding two options (for stream and for nbd), it seems better >>>> to >>>> add only one for generic layer. >>> >>> I don’t know. It doesn’t work generically, so I don’t know whether it >>> should be a generic option. >>> >>>> Then, it becomes possible to run guest on image with compress=on. It's a >>>> side >>>> effect, but still it should work correctly. >>> >>> How so? compress=on only works if every party involved only writes to >>> any cluster of the image exactly once. That is just not the case for >>> guests unless they know this limitation, and even I don’t see a use case. >>> >>>> I think the simplest thing is to just run normal write, if compressed write >>>> failed because of reallocation. We should check that on that failure-path >>>> ENOTSUP is returned and handle it for compress=on option by fallback to >>>> normal write in generic block/io.c >>> >>> It seems wrong to not compress data with compress=on. >> >> We already fallback to normal write if can't compress in qcow2. > > In a very specific case, namely where the compressed size is larger than > the uncompressed size. It would be a whole different story to only > compress the first write to a given cluster but not any following one. > >>>> (Note, that in case with stream we rewrite only unallocated clusters) >>> >>> Yes, but if the stream job writes the cluster before the guest, the >>> guest gets an I/O error. >> >> I don't think that using compressed writes by guest is really usable thing. >> In our case with stream there is no guest (just use qemu binary to operate >> on block layer) >> >>> >>> >>> By the way, the other thing I wondered was whether this should be a >>> filter driver instead (if it makes sense at all). Such a filter driver >>> would at least be sufficiently cumbersome to use that probably only >>> users who understand the implications would make use of it (1B above). >>> >>> >>> I’m not against adding a generic compress=on option if we ensure that it >>> actually works generically (2 above). It doesn’t right now, so I don’t >>> think this is right. >>> >>> I’m already not sure whether it’s really possible to support generic >>> compressed writes in qcow2. I suppose it’d be at least awkward. In >>> theory it should work, it’s just that we can’t keep track of subcluster >>> allocations, which in the worst case means that some compressed clusters >>> take up a whole cluster of space. >>> >>> For VMDK... I suppose we could consider a new flag for block drivers >>> that flags whether a driver supports arbitrary compressed writes or has >>> the old limitations. compress=on could then refuse to work on any block >>> driver but the ones that support arbitrary compressed writes. >>> >>> >>> Our current model (1A) is simply to ensure that all compressed writes >>> can adhere to the limitations. As I’ve said above, extending this to >>> NBD would mean adding some NBD negotiation so both client and server >>> agree on this limitation. >> >> In this case, I'd prefere simple cmdline option fro qemu-nbd. >> >>> That seems kind of reasonable from the qemu >>> side, but probably very unreasonable from an NBD side. Why would NBD >>> bother to reproduce qemu’s limitations? >>> >>> >>> So these are the three ways (1A, 1B, 2) I can imagine. But just adding >>> a generic option that unsuspecting users are not unlikely to use but >>> that simply doesn’t work generically doesn’t seem right to me. >>> >> >> Firstly, 1A already doesn't work correctly: we have compress option for >> backup. >> So, it will not work if backup target is not empty. > > I’m not sure whether that qualifies because the user is simply > responsible to ensure that the target is empty. > > Otherwise, you could also argue that backup doesn’t necessarily create a > real backup because with sync=top it won’t copy unallocated clusters, so > if the target already has data in such a place, it won’t be overwritten. > > Furthermore I’d argue this would hardly be an accident. Both > drive-backup with mode=existing and qemu-img convert -n are most likely > not used blindly but only when the situation requires it. > > My point is that using compress=on can be an accident because people see > that option and think “That sounds useful!” without reading the warning. > >> So, 1A is impossible, as it's already broken, adding separate options for >> stream >> and qemu-nbd is not better than just add one generic option. > > I disagree that 1A is broken, and I disagree a generic option would not > be worse. I still believe a simple generic option is prone to > accidental misuse. > >> I don't like (2) as it means a lot of effort to support actually not needed >> case. > > Well, you say it isn’t needed. I suppose if it were supported, people > would indeed use it. > >> I don't really like filter solution (as at seems too much to add a filter >> for simple >> boolean option), but I think, we can live with it. > > First, it is not a simple boolean option. This patch brings > implementation with it that converts writes to compressed writes and > sets a request alignment. I actually think it’s better to do things > like that in a dedicated block driver than in the generic block layer code. > > Second, we already this basically this for copy-on-read (which is also a > boolean option for -drive). > > (FWIW, I’d prefer if detect-zeroes were a block driver instead of > generic block layer code.) > >> 1B is OK for me, that is, just document what the option does in fact. >> >> May be, name the option "compress-new-writes" instead of just "compress"? >> And document, that writes to clusters, which was not written before will be >> compressed? >> >> Or make the option to be compress=<x>, where x may be 'no' or 'new-writes', >> reserving 'all-writes' for future? > > I don’t know whether the limitations can be captured in three words, but > it would at least make users pay closer attention to what the option is > really doing. > > OTOH, the only downsides to a dedicated block driver to me appear that > it’s more cumbersome to use (which I actually consider a benefit) and > that it requires the usual filter driver boilerplate – but that can be > copied from other drivers[1]. > > (The benefits are that we don’t need to modify generic code and that > people need to read into the filter and its caveats before they can use it.) > > Max > > > [1] Actually, we could provide default implementations somewhere and > make block filters use them. That should reduce the boilerplate size. >
If the support of COW for compressed writes is found feasible, will it make a sense to implement? Then this series will follow. A comment for the option would warn a user that guest writing will work slower for if compress selected. Andrey -- With the best regards, Andrey Shinkevich