Re: [Qemu-devel] [PULL 5/7] file-posix: Support BDRV_REQ_NO_FALLBACK for zero writes

Nir Soffer Sat, 17 Aug 2019 10:46:08 -0700

On Thu, Aug 15, 2019 at 1:29 PM Kevin Wolf <kw...@redhat.com> wrote:

> Am 15.08.2019 um 04:44 hat Eric Blake geschrieben:
> > On 3/26/19 10:51 AM, Kevin Wolf wrote:
> > > We know that the kernel implements a slow fallback code path for
> > > BLKZEROOUT, so if BDRV_REQ_NO_FALLBACK is given, we shouldn't call it.
> > > The other operations we call in the context of .bdrv_co_pwrite_zeroes
> > > should usually be quick, so no modification should be needed for them.
> > > If we ever notice that there are additional problematic cases, we can
> > > still make these conditional as well.
> >
> > Are there cases where fallocate(FALLOC_FL_ZERO_RANGE) falls back to slow
> > writes?  It may be fast on some file systems, but when used on a block
> > device, that may equally trigger slow fallbacks.  The man page is not
> > clear on that fact; I suspect that there may be cases in there that need
> > to be made conditional (it would be awesome if the kernel folks would
> > give us another FALLOC_ flag when we want to guarantee no fallback).
>
> The NO_FALLBACK changes were based on the Linux code rather than
> documentation because no interface is explicitly documented to forbid
> fallbacks.
>
> I think for file systems, we can generally assume that we don't get
> fallbacks because for file systems, just deallocating blocks is the
> easiest way to implement the function anyway. (Hm, or is it when we
> don't punch holes...?)
>
> And for block devices, we don't try FALLOC_FL_ZERO_RANGE because it also
> involves the same slow fallback as BLKZEROOUT. In other words,
> bdrv_co_pwrite_zeroes() with NO_FALLBACK, but without MAY_UNMAP, always
> fails on Linux block devices, and we fall back to emulation in user
> space.
>
> We would need a kernel interface that calls blkdev_issue_zeroout() with
> BLKDEV_ZERO_NOUNMAP | BLKDEV_ZERO_NOFALLBACK, but no such interface
> exists.
>
> When I talked to some file system people, they insisted that "efficient"
> or "fast" wasn't well-defined enough for them or something, so if we
> want to get a kernel change, maybe a new block device ioctl would be the
> most realistic thing.
>
> We do use FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE for MAY_UNMAP,
> which works for both file systems (I assume - each file system has a
> separate implementation) and block devices without slow fallbacks.
>
> qemu-img create sets MAY_UNMAP, so the case we are most interested in is
> covered with a fast implementation.
>
> > By the way, is there an easy setup to prove (maybe some qemu-img convert
> > command on a specially-prepared source image) whether the no fallback
> > flag makes a difference?  I'm about to cross-post a series of patches to
> > nbd/qemu/nbdkit/libnbd that adds a new NBD_CMD_FLAG_FAST_ZERO which fits
> > the bill of BDRV_REQ_NO_FALLBACK, but would like to include some
> > benchmark numbers in my cover letter if I can reproduce a setup where it
> > matters.
>
> Hm, the original case came from Nir, maybe he can suggest something.
>


The original case came from RHEL 7.{5,6}. The flow was:

qemu-img convert -> nbdkit rhv plugin -> imageio -> storage

nbdkit got NBD_CMD_WRITE_ZEROES request, converted it to imageio ZERO
request.

For block devices, imageio was trying:
1. fallocate(ZERO_RANGE) - fails
2. ioctl(BLKZEROOUT) - succeeds

See
https://github.com/oVirt/ovirt-imageio/blob/ca70170886b0c1fbeca8640b12bcf54f01a3fea0/common/ovirt_imageio_common/backends/file.py#L247

BLKZEROOUT can be fast (100 GiB/s) or slow (100 MiB/s) depending on the
server,
and on the allocation status of that area.

On our current storage (3PAR), if the device is fully allocated, for
example:

   dd if=/dev/zero bs=8M of=/dev/vg/lv

Then blkdiscard -z is slow (800 MiB/s):

But if you discard the device:

    blkdiscard /dev/vg/lv

blkdiscard -z becomes fast (100 GiB/s).

Previously we had XtremIO storage, which was able to zero 50 GiB/s
regardless
of the allocation.

You'll definitely need a block device that doesn't support
> FALLOC_FL_PUNCH_HOLE,


Old kernels (CentOS 7) did not support this.

# uname -r
3.10.0-957.21.3.el7.x86_64

# strace -e trace=fallocate fallocate -l 100m /dev/loop0
fallocate(3, 0, 0, 104857600)           = -1 ENODEV (No such device)
fallocate: fallocate failed: No such device
+++ exited with 1 +++

# strace -e trace=fallocate fallocate -p -l 100m /dev/loop0
fallocate(3, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 0, 104857600) = -1
ENODEV (No such device)
fallocate: fallocate failed: No such device
+++ exited with 1 +++

# strace -e trace=fallocate fallocate -z -l 100m /dev/loop0
fallocate(3, FALLOC_FL_ZERO_RANGE, 0, 104857600) = -1 ENODEV (No such
device)
fallocate: fallocate failed: No such device
+++ exited with 1 +++

otherwise you can't trigger the fallback. My
> first though was a loop device, but this actually does support the
> operation and passes it through to the underlying file system. So maybe
> if you know a file system that doesn't support it. Or if you have an old
> hard disk handy.

...

Nir

Re: [Qemu-devel] [PULL 5/7] file-posix: Support BDRV_REQ_NO_FALLBACK for zero writes

Reply via email to