On 2018-08-29 15:24, Max Reitz wrote: > On 2018-08-28 23:59, Eric Blake wrote: >> [following up to a different set of emails] >> >> On 08/28/2018 03:41 PM, Eric Blake wrote: >>> Revisiting this: >>> >>> On 08/01/2018 09:41 AM, Eric Blake wrote: >>>> Rich Jones pointed me to questionable behavior in qemu's NBD server >>>> implementation today: qemu advertises a minimum block size of 512 to >>>> any client that promises to honor block sizes, but when serving up a >>>> raw file that is not aligned to a sector boundary, attempting to read >>>> that final portion of the file results in a structured read with two >>>> chunks, the first for the data up to the end of the actual file, and >>>> the second reporting a hole for the rest of the sector. If a client >>>> is promising to obey block sizes on its requests, it seems odd that >>>> the server is allowed to send a result that is not also aligned to >>>> block sizes. >>>> >>>> Right now, the NBD spec says that when structured replies are in use, >>>> then for a structured read: >>>> >>>> The server MAY split the reply into any number of content chunks; >>>> each chunk MUST describe at least one byte, although to minimize >>>> overhead, the server SHOULD use chunks with lengths and offsets as >>>> an integer multiple of 512 bytes, where possible (the first and >>>> last chunk of an unaligned read being the most obvious places for >>>> an exception). >>>> >>>> I'm wondering if we should tighten that to require that the server >>>> partition the reply chunks to be aligned to the advertised minimum >>>> block size (at which point, qemu should either advertise 1 instead of >>>> 512 as the minimum size when serving up an unaligned file, or else >>>> qemu should just send the final partial sector as a single data chunk >>>> rather than trying to report the last few bytes as a hole). >>>> >>>> For comparison, on block status, we require: >>>> >>>> The server SHOULD use descriptor >>>> lengths that are an integer multiple of 512 bytes where possible >>>> (the first and last descriptor of an unaligned query being the >>>> most obvious places for an exception), and MUST use descriptor >>>> lengths that are an integer multiple of any advertised minimum >>>> block size. >>>> >>>> And qemu as a client currently hangs up on any server that violates >>>> that requirement on block status (that is, when qemu as the server >>>> tries to send a block status that was not aligned to the advertised >>>> block size, qemu as the client flags it as an invalid server - which >>>> means qemu as server is currently broken). So I'm thinking we should >>>> copy that requirement onto servers for reads as well. >>> >>> Vladimir pointed out that the problem is not necessarily just limited >>> to the implicit hole at the end of a file that was rounded up to >>> sector size. Another case where sub-region changes occur in qemu is >>> where you have a backing file with 512-byte hole granularity (qemu-img >>> create -f qcow2 -o cluster_size=512 backing.qcow2 100M) and an overlay >>> with larger granularity (qemu-img create -f qcow2 -b backing.qcow2 -F >>> qcow2 -o cluster_size=4k active.qcow2). On a cluster where the top >>> layer defers to the underlying layer, it is possible to alternate >>> between holes and data at sector boundaries but at subsets of the >>> cluster boundary of the top layer. As long as qemu advertises a >>> minimum block size of 512 rather than the cluster size, then this >>> isn't a problem, but if qemu were to change to report the qcow2 >>> cluster size as its minimum I/O (rather than merely its preferred I/O, >>> because it can do read-modify-write on data smaller than a cluster), >>> this would be another case where unaligned replies might confuse a >>> client. >> >> So, I tried to actually create this scenario, to see what actually >> happens, and we have a worse bug that needs to be resolved first. That >> is, bdrv_block_status_above() chokes when there is a blkdebug node in >> the works: >> >> $ qemu-img create -f qcow2 -o cluster_size=512 back.qcow2 100M >> $ qemu-io -c 'w P1 0 512' -c 'w -P1 1k 512' -f qcow2 back.qcow2 > > s/ P1/ -P1/ > >> $ qemu-img create -f qcow2 -F qcow2 -b back.qcow2 \ >> -o cluster_size=1M top.qcow2 >> $ qemu-img map --output=json -f qcow2 top.qcow2 >> [{ "start": 0, "length": 512, "depth": 1, "zero": false, "data": true, >> "offset": 27648}, >> { "start": 512, "length": 512, "depth": 1, "zero": true, "data": false}, >> { "start": 1024, "length": 512, "depth": 1, "zero": false, "data": true, >> "offset": 28160}, >> { "start": 1536, "length": 104856064, "depth": 1, "zero": true, "data": >> false}] >> $ qemu-img map --output=json --image-opts \ >> driver=blkdebug,image.driver=qcow2,image.file.driver=file,\ >> image.file.filename=top.qcow2,align=4k >> [{ "start": 0, "length": 104857600, "depth": 0, "zero": false, "data": >> false}] >> >> Yikes! Adding blkdebug says there is no data in the file at all! Actions >> like 'qemu-img convert' for copying between images would thus behave >> differently on a blkdebug image than they would on the real image, which >> somewhat defeats the purpose of blkdebug being a filter node. >> >> $ ./qemu-io --image-opts \ >> driver=blkdebug,image.driver=qcow2,image.file.driver=file,\ >> image.file.filename=top.qcow2,align=4k >> qemu-io> r -P1 0 512 >> read 512/512 bytes at offset 0 >> 512 bytes, 1 ops; 0.0002 sec (1.782 MiB/sec and 3649.6350 ops/sec) >> qemu-io> r -P0 512 512 >> read 512/512 bytes at offset 512 >> 512 bytes, 1 ops; 0.0002 sec (2.114 MiB/sec and 4329.0043 ops/sec) >> qemu-io> >> >> Meanwhile, the data from the backing file is clearly visible when read. >> So the bug must lie somewhere in the get_status operation. Looking >> closer, I see this in bdrv_co_block_status_above(): >> >> 2242 for (p = bs; p != base; p = backing_bs(p)) { >> >> When qcow2 is directly opened, this iterates to back.qcow2 and sees that >> there is data in the first cluster, changing the overall status reported >> to the caller. But when the qcow2 is hidden by blkdebug, backing_bs() >> states that blkdebug has no backing image, and terminates the loop early >> with JUST the status of the (empty) top file, rather than properly >> merging in the status from the backing file. >> >> I don't know if the bug lies in backing_bs(), or in the blkdebug driver, >> or in the combination of the two. Maybe it is as simple as fixing >> backing_bs() such that on a filter node bs, it defers to >> backing_bs(bs->file->bs), to make filter nodes behave as if they have >> the same backing file semantics as what they are wrapping. > > You mean my "block: Deal with filters" series? > > (Which replaces the backing_bs(p) with bdrv_filtered_bs(p) -- "filtered" > means both "normal" R/W filter nodes, and "backing" COW filtering.) > > qemu-img map doesn't work even with that series, though, because > bdrv_block_status() only gets the status of the topmost node. And if > that's a blkdebug node, well, you only get your blkdebug info. That is > pretty much working as intended. > > The real issue is in qemu-img map, which should just disregard filters. > Add a "bs = bdrv_skip_rw_filters(bs)" right at the top of the loop in > qemu-img.c:get_block_status(), and it works.
By the way, iotest 204 tests this behavior already. It issues a qemu-img map right at the end, onto a blkdebug'ed qcow2 file with a backing image. The allocation in the backing image is currently missing from the output, which is wrong. Max
signature.asc
Description: OpenPGP digital signature