On Tue, Aug 13, 2019 at 2:21 PM Kevin Wolf <kw...@redhat.com> wrote: > Am 13.08.2019 um 12:45 hat Nir Soffer geschrieben: > > On Mon, Aug 12, 2019 at 5:23 PM Kevin Wolf <kw...@redhat.com> wrote: > > > > > Am 11.08.2019 um 22:50 hat Nir Soffer geschrieben: > > > > In some cases buf_align or request_alignment cannot be detected: > > > > > > > > - With Gluster, buf_align cannot be detected since the actual I/O is > > > > done on Gluster server, and qemu buffer alignment does not matter. > > > > > > If it doesn't matter, the best value would be buf_align = 1. > > > > > > > Right, if we know that this is gluster. > > > > > - With local XFS filesystem, buf_align cannot be detected if reading > > > > from unallocated area. > > > > > > Here, we actually do need alignment, but it's unknown whether it would > > > be 512 or 4096 or something entirely. Failing to align requests > > > correctly results in I/O errors. > > > > > > > - With Gluster backed by XFS, request_alignment cannot be detected if > > > > reading from unallocated area. > > > > > > This is like buf_align for XFS: We don't know the right value, but > > > getting it wrong causes I/O errors. > > > > > > > - With NFS, the server does not use direct I/O, so both buf_align > > > > cannot be detected. > > > > > > This suggests that byte-aligned requests are fine for NFS, i.e. > > > buf_align = request_alignment = 1 would be optimal in this case. > > > > > > > Right, but again we don't know this is NFS. > > Yes, I agree. I was just trying to list the optimal settings for each > case so I could compare them against the actual results the path > provides. I'm well aware that we don't know a way to get the optimal > results for all four cases. > > > > These cases seems to work when storage sector size is 512 bytes, > because > > > > the current code starts checking align=512. If the check succeeds > > > > because alignment cannot be detected we use 512. But this does not > work > > > > for storage with 4k sector size. > > > > > > > > Practically the alignment requirements are the same for buffer > > > > alignment, buffer length, and offset in file. So in case we cannot > > > > detect buf_align, we can use request alignment. If we cannot detect > > > > request alignment, we can fallback to a safe value. > > > > > > This makes sense in general. > > > > > > What the commit message doesn't explain, but probably should do is how > > > we determine whether we could successfully detect request alignment. > The > > > approach taken here is that a detected alignment of 1 is understood as > > > failure to detect the real alignment. > > > > Failing with EINVAL when using 1, and succeeding with another value is > > considered a successful detection. > > > > We have 3 issues preventing detection: > > - filesystem not using direct I/O on the remote server (NFS, Gluster > > when network.remote-dio=on) > > - area probed is unallocated with XFS or Gluster backed by XFS > > - filesystem without buffer alignment requirement (e.g. Gluster) > > I would say case 1 is effectively a subset of case 3 (i.e. it's just one > specific reason why we don't have a buffer alignment requirement). > > > For handling unallocated areas, we can: > > - always allocate the first block when creating an image (qemu-img > > create/convert) > > - use write() instead of read(). > > > > In oVirt we went with the second option - when we initialize a file > > storage domain, we create a special file and do direct write to this > > file with 1, 512, and 4096 bytes length. If we detect 512 or 4096, we > > use this value for creating the domain (e.g. for sanlock). If we > > detect 1, we use the user provided value (default 512). > > Yes, but there's the important difference that oVirt controls the image > files, whereas QEMU doesn't. Even if qemu-img create made sure that we > allocate the first block, the user could still pass us an image that > was created using a different way. > > Using write() is actually an interesting thought. Obviously, we can't > just overwrite the user image. But maybe what we could do is read the > first block and then try to rewrite it with different alignments. >
Yes, this is what we do in ovirt-imageio for file based storage: https://github.com/oVirt/ovirt-imageio/blob/ca70170886b0c1fbeca8640b12bcf54f01a3fea0/common/ovirt_imageio_common/backends/file.py#L311 But we have lot of assumptions that may not work for qemu: - we don't support read only images - we assume nobody else is writing to the image imageio uses (enforced by oVirt) So this will not work for qemu-img read-only operations. However, this will break down with read-only images, so if we can't > write, we'd still have to fall back to a safe default. > > Also, given the straces we saw, I'm afraid we might trigger gluster bugs > where writes that failed with EINVAL mess up the internal state so that > even later aligned requests would fail. > If this happen only when using Gluster performance.strict-o-direct = off, we will enforce performance.strict-o-direct = in oVirt. Otherwise this is a Gluster bug and it should be fixed in Gluster. > You can see the code here: > > > https://github.com/oVirt/vdsm/blob/4733018f9a719729242738b486906d3b9ed058cd/lib/vdsm/storage/fileSD.py#L838 > > > > One way we can use in qemu is to create a temporary file: > > > > /path/to/image.tmp9vo8US > > > > Delete the file, keeping the fd open, and detect the alignment on this > > file using write(). > > This isn't going to fly. We might not have write permission to the > directory even for read-write images. Worse, we might get passed only a > file descriptor instead of a path. So whatever we do, we must do it with > the image file descriptor. > > > With this we fixed all the cases listed above, but creating new files > > requires write permission in the directory where the image is in, and > > will not work for some strange setups (.e.g bind-mount images). > > > > One issue with this is that there is no guarantee that the temporary > > file will be deleted so the user will have to deal with leftover > > files. > > On Linux, we could use O_TMPFILE for this. However, as I mentioned > above, we may not even know the directory. > ... Nir