On 06.06.23 10:00, Michael Tokarev wrote:
05.06.2023 18:45, Hanna Czenczek wrote:
When processing vectored guest requests that are not aligned to the
storage request alignment, we pad them by adding head and/or tail
buffers for a read-modify-write cycle.
The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
with this padding, the vector can exceed that limit. As of
4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
limit, instead returning an error to the guest.
To the guest, this appears as a random I/O error. We should not return
an I/O error to the guest when it issued a perfectly valid request.
Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
longer than IOV_MAX, which generally seems to work (because the guest
assumes a smaller alignment than we really have, file-posix's
raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
so emulate the request, so that the IOV_MAX does not matter). However,
that does not seem exactly great.
Why it is not "exactly great"? To me, it seems to be the best solution.
I'd say we should be able to tolerate "slight" increase over IOV_MAX for
"internal purposes", so to say. We can limit guest-supplied vector to
IOV_MAX, but internally guard against integer overflow only.
That’s a good point that may have been worth investigating.
I considered it not great because I assumed that that
4c002cef0e9abe7135d7916c51abce47f7fc1ee2 was written with intent, i.e.
that the IOV_MAX limit was put in because we just generally assume in
the block layer that it isn’t exceeded. It may have worked out fine
before for one specific protocol driver (file-posix) most of the
time[1], but I think it’s reasonable to assume that code in the block
layer has generally been written under the assumption that vectors will
not exceed the IOV_MAX limit (or otherwise we wouldn’t use that constant
in the block layer). So in addition to file-posix, we’d also need to
inspect other code (like the blkio driver that will submit vectored
requests to an external library) what the implications of exceeding that
limit are in all places.
That is not to say that it might not have been the simpler solution to
agree to exceed the limit internally, but it is to say that the full
implications would need to be investigated first. In contrast, the
solution added here is more complicated in code, but is localized.
[1] It won’t be fine if all buffers are 4k in size and 4k-aligned, which
I admit is unlikely for a request whose offset isn’t aligned, but which
could happen with a partition that isn’t aligned to 4k.
I see two ways to fix this problem:
1. We split such long requests into two requests.
2. We join some elements of the vector into new buffers to make it
shorter.
This seems to be over-complicated, both of them, no?
I would have preferred to have this discussion while the patch was still
on-list for review (this specific version was for two months, counting
from the first version was three). Do you think it is so complicated
and thus bug-prone that we must revert this series now and try the other
route?
I can agree that perhaps the other route could have been simpler, but
now we already have patches that are reviewed and in master, which solve
the problem. So I won’t spend more time on tackling this issue from
another angle. If you are happy to do so, patches are always welcome.
Hanna