On Thu, Aug 15, 2019 at 1:29 PM Kevin Wolf <kw...@redhat.com> wrote: > Am 15.08.2019 um 04:44 hat Eric Blake geschrieben: > > On 3/26/19 10:51 AM, Kevin Wolf wrote: > > > We know that the kernel implements a slow fallback code path for > > > BLKZEROOUT, so if BDRV_REQ_NO_FALLBACK is given, we shouldn't call it. > > > The other operations we call in the context of .bdrv_co_pwrite_zeroes > > > should usually be quick, so no modification should be needed for them. > > > If we ever notice that there are additional problematic cases, we can > > > still make these conditional as well. > > > > Are there cases where fallocate(FALLOC_FL_ZERO_RANGE) falls back to slow > > writes? It may be fast on some file systems, but when used on a block > > device, that may equally trigger slow fallbacks. The man page is not > > clear on that fact; I suspect that there may be cases in there that need > > to be made conditional (it would be awesome if the kernel folks would > > give us another FALLOC_ flag when we want to guarantee no fallback). > > The NO_FALLBACK changes were based on the Linux code rather than > documentation because no interface is explicitly documented to forbid > fallbacks. > > I think for file systems, we can generally assume that we don't get > fallbacks because for file systems, just deallocating blocks is the > easiest way to implement the function anyway. (Hm, or is it when we > don't punch holes...?) > > And for block devices, we don't try FALLOC_FL_ZERO_RANGE because it also > involves the same slow fallback as BLKZEROOUT. In other words, > bdrv_co_pwrite_zeroes() with NO_FALLBACK, but without MAY_UNMAP, always > fails on Linux block devices, and we fall back to emulation in user > space. > > We would need a kernel interface that calls blkdev_issue_zeroout() with > BLKDEV_ZERO_NOUNMAP | BLKDEV_ZERO_NOFALLBACK, but no such interface > exists. > > When I talked to some file system people, they insisted that "efficient" > or "fast" wasn't well-defined enough for them or something, so if we > want to get a kernel change, maybe a new block device ioctl would be the > most realistic thing. > > We do use FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE for MAY_UNMAP, > which works for both file systems (I assume - each file system has a > separate implementation) and block devices without slow fallbacks. > > qemu-img create sets MAY_UNMAP, so the case we are most interested in is > covered with a fast implementation. > > > By the way, is there an easy setup to prove (maybe some qemu-img convert > > command on a specially-prepared source image) whether the no fallback > > flag makes a difference? I'm about to cross-post a series of patches to > > nbd/qemu/nbdkit/libnbd that adds a new NBD_CMD_FLAG_FAST_ZERO which fits > > the bill of BDRV_REQ_NO_FALLBACK, but would like to include some > > benchmark numbers in my cover letter if I can reproduce a setup where it > > matters. > > Hm, the original case came from Nir, maybe he can suggest something. >
The original case came from RHEL 7.{5,6}. The flow was: qemu-img convert -> nbdkit rhv plugin -> imageio -> storage nbdkit got NBD_CMD_WRITE_ZEROES request, converted it to imageio ZERO request. For block devices, imageio was trying: 1. fallocate(ZERO_RANGE) - fails 2. ioctl(BLKZEROOUT) - succeeds See https://github.com/oVirt/ovirt-imageio/blob/ca70170886b0c1fbeca8640b12bcf54f01a3fea0/common/ovirt_imageio_common/backends/file.py#L247 BLKZEROOUT can be fast (100 GiB/s) or slow (100 MiB/s) depending on the server, and on the allocation status of that area. On our current storage (3PAR), if the device is fully allocated, for example: dd if=/dev/zero bs=8M of=/dev/vg/lv Then blkdiscard -z is slow (800 MiB/s): But if you discard the device: blkdiscard /dev/vg/lv blkdiscard -z becomes fast (100 GiB/s). Previously we had XtremIO storage, which was able to zero 50 GiB/s regardless of the allocation. You'll definitely need a block device that doesn't support > FALLOC_FL_PUNCH_HOLE, Old kernels (CentOS 7) did not support this. # uname -r 3.10.0-957.21.3.el7.x86_64 # strace -e trace=fallocate fallocate -l 100m /dev/loop0 fallocate(3, 0, 0, 104857600) = -1 ENODEV (No such device) fallocate: fallocate failed: No such device +++ exited with 1 +++ # strace -e trace=fallocate fallocate -p -l 100m /dev/loop0 fallocate(3, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 0, 104857600) = -1 ENODEV (No such device) fallocate: fallocate failed: No such device +++ exited with 1 +++ # strace -e trace=fallocate fallocate -z -l 100m /dev/loop0 fallocate(3, FALLOC_FL_ZERO_RANGE, 0, 104857600) = -1 ENODEV (No such device) fallocate: fallocate failed: No such device +++ exited with 1 +++ otherwise you can't trigger the fallback. My > first though was a loop device, but this actually does support the > operation and passes it through to the underlying file system. So maybe > if you know a file system that doesn't support it. Or if you have an old > hard disk handy. ... Nir