Am 04.03.2021 um 15:08 hat Stefano Garzarella geschrieben: > On Thu, Mar 04, 2021 at 01:05:02PM +0100, Kevin Wolf wrote: > > Am 03.03.2021 um 18:40 hat Stefano Garzarella geschrieben: > > > Hi Jason, > > > as reported in this BZ [1], when qemu-img creates a QCOW2 image on RBD > > > writing data is very slow compared to a raw file. > > > > > > Comparing raw vs QCOW2 image creation with RBD I found that we use a > > > different object size, for the raw file I see '4 MiB objects', for > > > QCOW2 I > > > see '64 KiB objects' as reported on comment 14 [2]. > > > This should be the main issue of slowness, indeed forcing in the code 4 > > > MiB > > > object size also for QCOW2 increased the speed a lot. > > > > > > Looking better I discovered that for raw files, we call rbd_create() with > > > obj_order = 0 (if 'cluster_size' options is not defined), so the default > > > object size is used. > > > Instead for QCOW2, we use obj_order = 16, since the default 'cluster_size' > > > defined for QCOW2, is 64 KiB. > > > > Hm, the QemuOpts-based image creation is messy, but why does the rbd > > driver even see the cluster_size option? > > > > The first thing qcow2_co_create_opts() does is splitting the passed > > QemuOpts into options it will process on the qcow2 layer and options > > that are passed to the protocol layer. So if you pass a cluster_size > > option, qcow2 should take it for itself and not pass it to rbd. > > > > If it is passed to rbd, I think that's a bug in the qcow2 driver. > > IIUC qcow2 properyl remove it, but when rbd uses qemu_opt_get_size_del(opts, > BLOCK_OPT_CLUSTER_SIZE, 0) the default value of qcow2 format is returned. > > Going in depth in qemu_opt_get_size_helper(), I found that qemu_opt_find() > properly returns a NULL pointer, but then we call find_default_by_name() > that returns the default value of qcow2 format (64k).
Ugh, I see why. We're passing the protocol driver a QemuOpts that was created for a QemuOptsList with the qcow2 default, not for its own QemuOptsList. This is wrong. Note that the QemuOptsList is not qcow2_create_opts itself, but a list that is created with qemu_opts_append() to combine qcow2 and rbd options into a new QemuOptsList. For overlapping options, the format wins. I don't think you can change the QemuOptsList of an existing QemuOpts, nor is there a clone operation that could just copy all options into a new QemuOpts created for the rbd QemuOptsList, so maybe the easiest hack^Wsolution would be converting to QDict and back... > > > Using '-o cluster_size=2M' with qemu-img changed only the qcow2 cluster > > > size, since in qcow2_co_create_opts() we remove the 'cluster_size' from > > > QemuOpts calling qemu_opts_to_qdict_filtered(). > > > For some reason that I have yet to understand, after this deletion, > > > however > > > remains in QemuOpts the default value of 'cluster_size' for qcow2 (64 > > > KiB), > > > that it's used in qemu_rbd_co_create_opts() > > > > So it seems you came to a similar conclusion. We need to find out where > > the 64k come from and just fix that so that rbd uses its default. > > Yes, I tried debugging above, but I'm not sure how to fix it. > > Maybe a new parameter in qemu_opt_get_size_helper() to prevent it from > looking for the default value. > Or we should prevent the default value from being added to the > opts->list->desc, but that part is still not very clear to me. opts->list is already wrong, I think this is what we need to fix. > > > At this point my doubts are: > > > Does it make sense to use the same cluster_size as qcow2 as object_size in > > > RBD? > > > If we want to keep the 2 options separated, how can it be done? Should we > > > rename the option in block/rbd.c? > > > > My lazy answer is that you could just use QMP blockdev-create, where you > > create layer by layer separately. > > > > What could possibly be done for the QemuOpts is using the dotted syntax > > like for opening, so you could specify file.cluster_size=... for the > > protocol layer (or data_file.cluster_size=... for the external data > > file etc.) > > This would be cool :-) I'm almost sure that compatibility will make this more complicated than it sounds, but we could have a try. Kevin