On 2018-08-28 23:59, Eric Blake wrote: > [following up to a different set of emails] > > On 08/28/2018 03:41 PM, Eric Blake wrote: >> Revisiting this: >> >> On 08/01/2018 09:41 AM, Eric Blake wrote: >>> Rich Jones pointed me to questionable behavior in qemu's NBD server >>> implementation today: qemu advertises a minimum block size of 512 to >>> any client that promises to honor block sizes, but when serving up a >>> raw file that is not aligned to a sector boundary, attempting to read >>> that final portion of the file results in a structured read with two >>> chunks, the first for the data up to the end of the actual file, and >>> the second reporting a hole for the rest of the sector. If a client >>> is promising to obey block sizes on its requests, it seems odd that >>> the server is allowed to send a result that is not also aligned to >>> block sizes. >>> >>> Right now, the NBD spec says that when structured replies are in use, >>> then for a structured read: >>> >>> The server MAY split the reply into any number of content chunks; >>> each chunk MUST describe at least one byte, although to minimize >>> overhead, the server SHOULD use chunks with lengths and offsets as >>> an integer multiple of 512 bytes, where possible (the first and >>> last chunk of an unaligned read being the most obvious places for >>> an exception). >>> >>> I'm wondering if we should tighten that to require that the server >>> partition the reply chunks to be aligned to the advertised minimum >>> block size (at which point, qemu should either advertise 1 instead of >>> 512 as the minimum size when serving up an unaligned file, or else >>> qemu should just send the final partial sector as a single data chunk >>> rather than trying to report the last few bytes as a hole). >>> >>> For comparison, on block status, we require: >>> >>> The server SHOULD use descriptor >>> lengths that are an integer multiple of 512 bytes where possible >>> (the first and last descriptor of an unaligned query being the >>> most obvious places for an exception), and MUST use descriptor >>> lengths that are an integer multiple of any advertised minimum >>> block size. >>> >>> And qemu as a client currently hangs up on any server that violates >>> that requirement on block status (that is, when qemu as the server >>> tries to send a block status that was not aligned to the advertised >>> block size, qemu as the client flags it as an invalid server - which >>> means qemu as server is currently broken). So I'm thinking we should >>> copy that requirement onto servers for reads as well. >> >> Vladimir pointed out that the problem is not necessarily just limited >> to the implicit hole at the end of a file that was rounded up to >> sector size. Another case where sub-region changes occur in qemu is >> where you have a backing file with 512-byte hole granularity (qemu-img >> create -f qcow2 -o cluster_size=512 backing.qcow2 100M) and an overlay >> with larger granularity (qemu-img create -f qcow2 -b backing.qcow2 -F >> qcow2 -o cluster_size=4k active.qcow2). On a cluster where the top >> layer defers to the underlying layer, it is possible to alternate >> between holes and data at sector boundaries but at subsets of the >> cluster boundary of the top layer. As long as qemu advertises a >> minimum block size of 512 rather than the cluster size, then this >> isn't a problem, but if qemu were to change to report the qcow2 >> cluster size as its minimum I/O (rather than merely its preferred I/O, >> because it can do read-modify-write on data smaller than a cluster), >> this would be another case where unaligned replies might confuse a >> client. > > So, I tried to actually create this scenario, to see what actually > happens, and we have a worse bug that needs to be resolved first. That > is, bdrv_block_status_above() chokes when there is a blkdebug node in > the works: > > $ qemu-img create -f qcow2 -o cluster_size=512 back.qcow2 100M > $ qemu-io -c 'w P1 0 512' -c 'w -P1 1k 512' -f qcow2 back.qcow2
s/ P1/ -P1/ > $ qemu-img create -f qcow2 -F qcow2 -b back.qcow2 \ > -o cluster_size=1M top.qcow2 > $ qemu-img map --output=json -f qcow2 top.qcow2 > [{ "start": 0, "length": 512, "depth": 1, "zero": false, "data": true, > "offset": 27648}, > { "start": 512, "length": 512, "depth": 1, "zero": true, "data": false}, > { "start": 1024, "length": 512, "depth": 1, "zero": false, "data": true, > "offset": 28160}, > { "start": 1536, "length": 104856064, "depth": 1, "zero": true, "data": > false}] > $ qemu-img map --output=json --image-opts \ > driver=blkdebug,image.driver=qcow2,image.file.driver=file,\ > image.file.filename=top.qcow2,align=4k > [{ "start": 0, "length": 104857600, "depth": 0, "zero": false, "data": > false}] > > Yikes! Adding blkdebug says there is no data in the file at all! Actions > like 'qemu-img convert' for copying between images would thus behave > differently on a blkdebug image than they would on the real image, which > somewhat defeats the purpose of blkdebug being a filter node. > > $ ./qemu-io --image-opts \ > driver=blkdebug,image.driver=qcow2,image.file.driver=file,\ > image.file.filename=top.qcow2,align=4k > qemu-io> r -P1 0 512 > read 512/512 bytes at offset 0 > 512 bytes, 1 ops; 0.0002 sec (1.782 MiB/sec and 3649.6350 ops/sec) > qemu-io> r -P0 512 512 > read 512/512 bytes at offset 512 > 512 bytes, 1 ops; 0.0002 sec (2.114 MiB/sec and 4329.0043 ops/sec) > qemu-io> > > Meanwhile, the data from the backing file is clearly visible when read. > So the bug must lie somewhere in the get_status operation. Looking > closer, I see this in bdrv_co_block_status_above(): > > 2242 for (p = bs; p != base; p = backing_bs(p)) { > > When qcow2 is directly opened, this iterates to back.qcow2 and sees that > there is data in the first cluster, changing the overall status reported > to the caller. But when the qcow2 is hidden by blkdebug, backing_bs() > states that blkdebug has no backing image, and terminates the loop early > with JUST the status of the (empty) top file, rather than properly > merging in the status from the backing file. > > I don't know if the bug lies in backing_bs(), or in the blkdebug driver, > or in the combination of the two. Maybe it is as simple as fixing > backing_bs() such that on a filter node bs, it defers to > backing_bs(bs->file->bs), to make filter nodes behave as if they have > the same backing file semantics as what they are wrapping. You mean my "block: Deal with filters" series? (Which replaces the backing_bs(p) with bdrv_filtered_bs(p) -- "filtered" means both "normal" R/W filter nodes, and "backing" COW filtering.) qemu-img map doesn't work even with that series, though, because bdrv_block_status() only gets the status of the topmost node. And if that's a blkdebug node, well, you only get your blkdebug info. That is pretty much working as intended. The real issue is in qemu-img map, which should just disregard filters. Add a "bs = bdrv_skip_rw_filters(bs)" right at the top of the loop in qemu-img.c:get_block_status(), and it works. Max
signature.asc
Description: OpenPGP digital signature