>On Thu, Mar 25, 2021 at 8:07 AM ChangLimin <chan...@chinatelecom.cn> wrote: >>On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <mre...@redhat.com> wrote: >>On 22.03.21 10:25, ChangLimin wrote: >>> For Linux 5.10/5.11, qemu write zeros to a multipath device using >>> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY >>> permanently. >> >>So as far as I can track back the discussion, Kevin asked on v1 why we’d >>set has_write_zeroes to false, i.e. whether the EBUSY might not go away >>at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then. >>You haven’t explicitly replied to that question (as far as I can see), >>so it kind of still stands. >> >>Implicitly, there are two conflicting answers in this patch: On one >>hand, the commit message says “permanently”, and this is what you told >>Nir as a realistic case where this can occur. > >For Linux 5.10/5.11, the EBUSY is permanently, the reproduce step is below. >For other Linux version, the EBUSY may be temporary. >Because Linux 5.10/5.11 is not used widely, so do not set has_write_zeroes to >false. > >>I'm afraid ChangLimin did not answer my question. I'm looking for real >>world used case when qemu cannot write zeros to multipath device, when >>nobody else is using the device. >> >>I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert, >>once with a multipath device, and once with logical volume on a vg created >>on the multipath device, and I could not reproduce this issue. > >The following is steps to reproduct the issue on Fedora 34. > ># uname -a >Linux fedora-34 5.11.3-300.fc34.x86_64 #1 SMP Thu Mar 4 19:03:18 UTC 2021 >x86_64 x86_64 x86_64 GNU/Linux > >Is this the most recent kernel? I have 5.11.7 in fedora 32. > > ># qemu-img -V >qemu-img version 5.2.0 (qemu-5.2.0-5.fc34.1) > >1. Login in an ISCSI LUN created using targetcli on ubuntu 20.04 ># iscsiadm -m discovery -t st -p 192.169.1.109 >192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100 > ># iscsiadm -m node -l -T iqn.2003-01.org.linux-iscsi:lio-lv100 ># iscsiadm -m session >tcp: [1] 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100 (non-flash) > >2. start multipathd service ># mpathconf --enable ># systemctl start multipathd > >3. add multipath path ># multipath -a `/lib/udev/scsi_id -g /dev/sdb` # sdb means the ISCSI LUN >wwid '36001405b76856e4816b48b99c6a77de3' added > ># multipathd add path /dev/sdb >ok > ># multipath -ll # /dev/dm-1 is the multipath device based on /dev/sdb >mpatha (36001405bebfc3a0522541cda30220db9) dm-1 LIO-ORG,lv102 >size=1.0G features='0' hwhandler='1 alua' wp=rw >`-+- policy='service-time 0' prio=50 status=active > `- 5:0:0:0 sdd 8:48 active ready running > >You are using user_friendly_names which is (sadly) the default. >But I don't think it should matter. > >4. qemu-img return EBUSY both to dm-1 and sdb ># wget http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img ># qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/dm-1 >qemu-img: error while writing at byte 0: Device or resource busy > ># qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/sdb >qemu-img: error while writing at byte 0: Device or resource busy > >5. blkdiscard also return EBUSY both to dm-1 and sdb ># blkdiscard -o 0 -l 4096 /dev/dm-1 >blkdiscard: cannot open /dev/dm-1: Device or resource busy > ># blkdiscard -o 0 -l 4096 /dev/sdb >blkdiscard: cannot open /dev/sdb: No such file or directory > >6. dd write zero is good, because it does not use blkdiscard ># dd if=/dev/zero of=/dev/dm-1 bs=1M count=100 oflag=direct >100+0 records in >100+0 records out >104857600 bytes (105 MB, 100 MiB) copied, 2.33623 s, 44.9 MB/s > >7. The LUN should support blkdiscard feature, otherwise it will not write zero >with ioctl(fd, BLKZEROOUT, range) > >Thanks! > >I could not reproduce this with kernel 5.10, but now I'm no 5.11: ># uname -r >5.11.7-100.fc32.x86_64 > ># qemu-img --version >qemu-img version 5.2.0 (qemu-5.2.0-6.fc32.1) >Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers > ># cat /etc/multipath.conf >defaults { >user_friendly_names no >find_multipaths no >} > >blacklist_exceptions { > property "(SCSI_IDENT_|ID_WWN)" >} > >blacklist { >} > ># multipath -ll 36001405e884ab8ff4b44fdba6901099c >36001405e884ab8ff4b44fdba6901099c dm-8 LIO-ORG,3-09 >size=6.0G features='0' hwhandler='1 alua' wp=rw >`-+- policy='service-time 0' prio=50 status=active > `- 1:0:0:9 sdk 8:160 active ready running > >$ lsblk /dev/sdk >NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT >sdk 8:160 0 6G 0 disk >└─36001405e884ab8ff4b44fdba6901099c 253:13 0 6G 0 mpath > >$ virt-builder fedora-32 -o disk.img >[ 2.9] Downloading: http://builder.libguestfs.org/fedora-32.xz >[ 3.8] Planning how to build this image >[ 3.8] Uncompressing >[ 11.1] Opening the new disk >[ 16.1] Setting a random seed >[ 16.1] Setting passwords >virt-builder: Setting random password of root to TcikQqRxAaIqS1kF >[ 17.0] Finishing off > Output file: disk.img > Output size: 6.0G > Output format: raw > Total usable space: 5.4G > Free space: 4.0G (74%) > >$ qemu-img info disk.img >image: disk.img >file format: raw >virtual size: 6 GiB (6442450944 bytes) >disk size: 1.29 GiB > ># qemu-img convert -p -f raw -O raw -t none -T none disk.img >/dev/mapper/36001405e884ab8ff4b44fdba6901099c > (100.00/100%) > >Works. > ># lsblk /dev/sdk >NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT >sdk 8:160 0 6G 0 disk >└─36001405e884ab8ff4b44fdba6901099c 253:13 0 6G 0 mpath > ├─36001405e884ab8ff4b44fdba6901099c1 253:14 0 1M 0 part > ├─36001405e884ab8ff4b44fdba6901099c2 253:15 0 1G 0 part > ├─36001405e884ab8ff4b44fdba6901099c3 253:16 0 615M 0 part > └─36001405e884ab8ff4b44fdba6901099c4 253:17 0 4.4G 0 part > ># qemu-img convert -p -f raw -O raw -t none -T none disk.img >/dev/mapper/36001405e884ab8ff4b44fdba6901099c > (100.00/100%) > >Works, wiping the partitions. > ># mount /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 /tmp/mnt > ># mount | grep /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 >/dev/mapper/36001405e884ab8ff4b44fdba6901099c4 on /tmp/mnt type xfs >(rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) > >This is invalid use, converting to device with mounted file system, but let's >try: > ># qemu-img convert -p -f raw -O raw -t none -T none disk.img >/dev/mapper/36001405e884ab8ff4b44fdba6901099c > (100.00/100%) > >Still works?! > ># wipefs -a /dev/mapper/36001405e884ab8ff4b44fdba6901099c >wipefs: error: /dev/mapper/36001405e884ab8ff4b44fdba6901099c: probing >initialization failed: Device or resource busy > ># blkdiscard /dev/mapper/36001405e884ab8ff4b44fdba6901099c > >Works. > >This the configuration of the LUN on the server side: > > { > > "aio": false, > > "alua_tpgs": [ > > { > > "alua_access_state": 0, > > "alua_access_status": 0, > > "alua_access_type": 3, > > "alua_support_active_nonoptimized": 1, > > "alua_support_active_optimized": 1, > > "alua_support_offline": 1, > > "alua_support_standby": 1, > > "alua_support_transitioning": 1, > > "alua_support_unavailable": 1, > > "alua_write_metadata": 0, > > "implicit_trans_secs": 0, > > "name": "default_tg_pt_gp", > > "nonop_delay_msecs": 100, > > "preferred": 0, > > "tg_pt_gp_id": 0, > > "trans_delay_msecs": 0 > > } > > ], > > "attributes": { > > "block_size": 512, > > "emulate_3pc": 1, > > "emulate_caw": 1, > > "emulate_dpo": 1, > > "emulate_fua_read": 1, > > "emulate_fua_write": 1, > > "emulate_model_alias": 1, > > "emulate_pr": 1, > > "emulate_rest_reord": 0, > > "emulate_tas": 1, > > "emulate_tpu": 1, > > "emulate_tpws": 1, > > "emulate_ua_intlck_ctrl": 0, > > "emulate_write_cache": 1, > > "enforce_pr_isids": 1, > > "force_pr_aptpl": 0, > > "is_nonrot": 0, > > "max_unmap_block_desc_count": 1, > > "max_unmap_lba_count": 8192, > > "max_write_same_len": 65335, > > "optimal_sectors": 16384, > > "pi_prot_format": 0, > > "pi_prot_type": 0, > > "pi_prot_verify": 0, > > "queue_depth": 128, > > "unmap_granularity": 1, > > "unmap_granularity_alignment": 0, > > "unmap_zeroes_data": 0 > > }, > > "dev": "/target/3/09", > > "name": "3-09", > > "plugin": "fileio", > > "size": 6442450944, > > "write_back": true, > > "wwn": "e884ab8f-f4b4-4fdb-a690-1099c072c86d" > > }, > > >Maybe this upstream change is not in all downstream 5.11 kernels, or 5.11.7 >already includes the fix?
Linux 5.10.24/5.11.7 already merged the fix on 2021-03-17 by Greg Kroah-Hartman. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.11.y&id=7e0815797656f029fab2edc309406cddf931b9d8 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=d44c9780ed40db88626c9354868eab72159c7a7f # git describe d44c9780ed40db88626c9354868eab72159c7a7f v5.10.23-184-gd44c9780ed40 # git describe 7e0815797656f029fab2edc309406cddf931b9d8 v5.11.6-178-g7e0815797656 >Adding Ben, maybe he had more insight on the multipath side. > >>If I understand the kernel change correctly, this can happen when there is >>a mounted file system on top of the multipath device. I don't think we have >>a use case when qemu accesses a multipath device when the device is used >>by a file system, but maybe I missed something. >> >>So that to me implies >>that we actually should not retry BLKZEROOUT, because the EBUSY will >>remain, and that condition won’t change while the block device is in use >>by qemu. >> >>On the other hand, in the code, you have decided not to reset >>has_write_zeroes to false, so the implementation will retry. >> >>EBUSY is usually a temporary error, so retrying makes sense. The question >>is if we really can write zeroes manually in this case? >> >>So I don’t quite understand. Should we keep trying BLKZEROOUT or is >>there no chance of it working after it has at one point failed with >>EBUSY? (Are there other cases besides what’s described in this commit >>message where EBUSY might be returned and it is only temporary?) >> >>> Fallback to pwritev instead of exit for -EBUSY error. >>> >>> The issue was introduced in Linux 5.10: >>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02 >>> >>> Fixed in Linux 5.12: >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d >>> >>> Signed-off-by: ChangLimin <chan...@chinatelecom.cn> >>> --- >>> block/file-posix.c | 8 ++++++-- >>> 1 file changed, 6 insertions(+), 2 deletions(-) >>> >>> diff --git a/block/file-posix.c b/block/file-posix.c >>> index 20e14f8e96..d4054ac9cb 100644 >>> --- a/block/file-posix.c >>> +++ b/block/file-posix.c >>> @@ -1624,8 +1624,12 @@ static ssize_t >>> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb) >>> } while (errno == EINTR); >>> >>> ret = translate_err(-errno); >>> - if (ret == -ENOTSUP) { >>> - s->has_write_zeroes = false; >>> + switch (ret) { >>> + case -ENOTSUP: >>> + s->has_write_zeroes = false; /* fall through */ >>> + case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for multipath >>> devices */ >>> + return -ENOTSUP; >>> + break; >> >>(Not sure why this break is here.) >> >>Max >> >>> } >>> } >>> #endif >>> -- >>> 2.27.0 >>>