Re: [Qemu-devel] blkdebug get_status bug [was: [Qemu-block] NBD structured reads vs. block size]

Max Reitz Wed, 29 Aug 2018 08:22:55 -0700

On 2018-08-29 15:24, Max Reitz wrote:
> On 2018-08-28 23:59, Eric Blake wrote:
>> [following up to a different set of emails]
>>
>> On 08/28/2018 03:41 PM, Eric Blake wrote:
>>> Revisiting this:
>>>
>>> On 08/01/2018 09:41 AM, Eric Blake wrote:
>>>> Rich Jones pointed me to questionable behavior in qemu's NBD server
>>>> implementation today: qemu advertises a minimum block size of 512 to
>>>> any client that promises to honor block sizes, but when serving up a
>>>> raw file that is not aligned to a sector boundary, attempting to read
>>>> that final portion of the file results in a structured read with two
>>>> chunks, the first for the data up to the end of the actual file, and
>>>> the second reporting a hole for the rest of the sector. If a client
>>>> is promising to obey block sizes on its requests, it seems odd that
>>>> the server is allowed to send a result that is not also aligned to
>>>> block sizes.
>>>>
>>>> Right now, the NBD spec says that when structured replies are in use,
>>>> then for a structured read:
>>>>
>>>>      The server MAY split the reply into any number of content chunks;
>>>>      each chunk MUST describe at least one byte, although to minimize
>>>>      overhead, the server SHOULD use chunks with lengths and offsets as
>>>>      an integer multiple of 512 bytes, where possible (the first and
>>>>      last chunk of an unaligned read being the most obvious places for
>>>>      an exception).
>>>>
>>>> I'm wondering if we should tighten that to require that the server
>>>> partition the reply chunks to be aligned to the advertised minimum
>>>> block size (at which point, qemu should either advertise 1 instead of
>>>> 512 as the minimum size when serving up an unaligned file, or else
>>>> qemu should just send the final partial sector as a single data chunk
>>>> rather than trying to report the last few bytes as a hole).
>>>>
>>>> For comparison, on block status, we require:
>>>>
>>>>     The server SHOULD use descriptor
>>>>      lengths that are an integer multiple of 512 bytes where possible
>>>>      (the first and last descriptor of an unaligned query being the
>>>>      most obvious places for an exception), and MUST use descriptor
>>>>      lengths that are an integer multiple of any advertised minimum
>>>>      block size.
>>>>
>>>> And qemu as a client currently hangs up on any server that violates
>>>> that requirement on block status (that is, when qemu as the server
>>>> tries to send a block status that was not aligned to the advertised
>>>> block size, qemu as the client flags it as an invalid server - which
>>>> means qemu as server is currently broken).  So I'm thinking we should
>>>> copy that requirement onto servers for reads as well.
>>>
>>> Vladimir pointed out that the problem is not necessarily just limited
>>> to the implicit hole at the end of a file that was rounded up to
>>> sector size.  Another case where sub-region changes occur in qemu is
>>> where you have a backing file with 512-byte hole granularity (qemu-img
>>> create -f qcow2 -o cluster_size=512 backing.qcow2 100M) and an overlay
>>> with larger granularity (qemu-img create -f qcow2 -b backing.qcow2 -F
>>> qcow2 -o cluster_size=4k active.qcow2). On a cluster where the top
>>> layer defers to the underlying layer, it is possible to alternate
>>> between holes and data at sector boundaries but at subsets of the
>>> cluster boundary of the top layer.  As long as qemu advertises a
>>> minimum block size of 512 rather than the cluster size, then this
>>> isn't a problem, but if qemu were to change to report the qcow2
>>> cluster size as its minimum I/O (rather than merely its preferred I/O,
>>> because it can do read-modify-write on data smaller than a cluster),
>>> this would be another case where unaligned replies might confuse a
>>> client.
>>
>> So, I tried to actually create this scenario, to see what actually
>> happens, and we have a worse bug that needs to be resolved first.  That
>> is, bdrv_block_status_above() chokes when there is a blkdebug node in
>> the works:
>>
>> $ qemu-img create -f qcow2 -o cluster_size=512 back.qcow2 100M
>> $ qemu-io -c 'w P1 0 512' -c 'w -P1 1k 512' -f qcow2 back.qcow2
> 
> s/ P1/ -P1/
> 
>> $ qemu-img create -f qcow2 -F qcow2 -b back.qcow2 \
>>     -o cluster_size=1M top.qcow2
>> $ qemu-img map --output=json -f qcow2 top.qcow2
>> [{ "start": 0, "length": 512, "depth": 1, "zero": false, "data": true,
>> "offset": 27648},
>> { "start": 512, "length": 512, "depth": 1, "zero": true, "data": false},
>> { "start": 1024, "length": 512, "depth": 1, "zero": false, "data": true,
>> "offset": 28160},
>> { "start": 1536, "length": 104856064, "depth": 1, "zero": true, "data":
>> false}]
>> $ qemu-img map --output=json --image-opts \
>>    driver=blkdebug,image.driver=qcow2,image.file.driver=file,\
>> image.file.filename=top.qcow2,align=4k
>> [{ "start": 0, "length": 104857600, "depth": 0, "zero": false, "data":
>> false}]
>>
>> Yikes! Adding blkdebug says there is no data in the file at all! Actions
>> like 'qemu-img convert' for copying between images would thus behave
>> differently on a blkdebug image than they would on the real image, which
>> somewhat defeats the purpose of blkdebug being a filter node.
>>
>> $ ./qemu-io --image-opts \
>>    driver=blkdebug,image.driver=qcow2,image.file.driver=file,\
>> image.file.filename=top.qcow2,align=4k
>> qemu-io> r -P1 0 512
>> read 512/512 bytes at offset 0
>> 512 bytes, 1 ops; 0.0002 sec (1.782 MiB/sec and 3649.6350 ops/sec)
>> qemu-io> r -P0 512 512
>> read 512/512 bytes at offset 512
>> 512 bytes, 1 ops; 0.0002 sec (2.114 MiB/sec and 4329.0043 ops/sec)
>> qemu-io>
>>
>> Meanwhile, the data from the backing file is clearly visible when read.
>> So the bug must lie somewhere in the get_status operation.  Looking
>> closer, I see this in bdrv_co_block_status_above():
>>
>> 2242        for (p = bs; p != base; p = backing_bs(p)) {
>>
>> When qcow2 is directly opened, this iterates to back.qcow2 and sees that
>> there is data in the first cluster, changing the overall status reported
>> to the caller.  But when the qcow2 is hidden by blkdebug, backing_bs()
>> states that blkdebug has no backing image, and terminates the loop early
>> with JUST the status of the (empty) top file, rather than properly
>> merging in the status from the backing file.
>>
>> I don't know if the bug lies in backing_bs(), or in the blkdebug driver,
>> or in the combination of the two.  Maybe it is as simple as fixing
>> backing_bs() such that on a filter node bs, it defers to
>> backing_bs(bs->file->bs), to make filter nodes behave as if they have
>> the same backing file semantics as what they are wrapping.
> 
> You mean my "block: Deal with filters" series?
> 
> (Which replaces the backing_bs(p) with bdrv_filtered_bs(p) -- "filtered"
> means both "normal" R/W filter nodes, and "backing" COW filtering.)
> 
> qemu-img map doesn't work even with that series, though, because
> bdrv_block_status() only gets the status of the topmost node.  And if
> that's a blkdebug node, well, you only get your blkdebug info.  That is
> pretty much working as intended.
> 
> The real issue is in qemu-img map, which should just disregard filters.
> Add a "bs = bdrv_skip_rw_filters(bs)" right at the top of the loop in
> qemu-img.c:get_block_status(), and it works.


By the way, iotest 204 tests this behavior already.  It issues a
qemu-img map right at the end, onto a blkdebug'ed qcow2 file with a
backing image.  The allocation in the backing image is currently missing
from the output, which is wrong.

Max

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-devel] blkdebug get_status bug [was: [Qemu-block] NBD structured reads vs. block size]

Reply via email to