On Thu, Mar 25, 2021 at 8:07 AM ChangLimin <chan...@chinatelecom.cn> wrote:
> >On Wed, Mar 24, 2021 at 4:52 PM Max Reitz <mre...@redhat.com> wrote: > >On 22.03.21 10:25, ChangLimin wrote: > >> For Linux 5.10/5.11, qemu write zeros to a multipath device using > >> ioctl(fd, BLKZEROOUT, range) with cache none or directsync return -EBUSY > >> permanently. > > > >So as far as I can track back the discussion, Kevin asked on v1 why we’d > >set has_write_zeroes to false, i.e. whether the EBUSY might not go away > >at some point, and if it did, whether we shouldn’t retry BLKZEROOUT then. > >You haven’t explicitly replied to that question (as far as I can see), > >so it kind of still stands. > > > >Implicitly, there are two conflicting answers in this patch: On one > >hand, the commit message says “permanently”, and this is what you told > >Nir as a realistic case where this can occur. > > For Linux 5.10/5.11, the EBUSY is permanently, the reproduce step is > below. > For other Linux version, the EBUSY may be temporary. > Because Linux 5.10/5.11 is not used widely, so do not set has_write_zeroes > to false. > > >I'm afraid ChangLimin did not answer my question. I'm looking for real > >world used case when qemu cannot write zeros to multipath device, when > >nobody else is using the device. > > > >I tried to reproduce this on Fedora (kernel 5.10) with qemu-img convert, > >once with a multipath device, and once with logical volume on a vg created > >on the multipath device, and I could not reproduce this issue. > > The following is steps to reproduct the issue on Fedora 34. > > # uname -a > Linux fedora-34 5.11.3-300.fc34.x86_64 #1 SMP Thu Mar 4 19:03:18 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > Is this the most recent kernel? I have 5.11.7 in fedora 32. > > # qemu-img -V > qemu-img version 5.2.0 (qemu-5.2.0-5.fc34.1) > > 1. Login in an ISCSI LUN created using targetcli on ubuntu 20.04 > # iscsiadm -m discovery -t st -p 192.169.1.109 > 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100 > > # iscsiadm -m node -l -T iqn.2003-01.org.linux-iscsi:lio-lv100 > # iscsiadm -m session > tcp: [1] 192.169.1.109:3260,1 iqn.2003-01.org.linux-iscsi:lio-lv100 > (non-flash) > > 2. start multipathd service > # mpathconf --enable > # systemctl start multipathd > > 3. add multipath path > # multipath -a `/lib/udev/scsi_id -g /dev/sdb` # sdb means the ISCSI LUN > wwid '36001405b76856e4816b48b99c6a77de3' added > > # multipathd add path /dev/sdb > ok > > # multipath -ll # /dev/dm-1 is the multipath device based on /dev/sdb > mpatha (36001405bebfc3a0522541cda30220db9) dm-1 LIO-ORG,lv102 > size=1.0G features='0' hwhandler='1 alua' wp=rw > `-+- policy='service-time 0' prio=50 status=active > `- 5:0:0:0 sdd 8:48 active ready running > You are using user_friendly_names which is (sadly) the default. But I don't think it should matter. 4. qemu-img return EBUSY both to dm-1 and sdb > # wget http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img > # qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/dm-1 > qemu-img: error while writing at byte 0: Device or resource busy > > # qemu-img convert -O raw -t none cirros-0.4.0-x86_64-disk.img /dev/sdb > qemu-img: error while writing at byte 0: Device or resource busy > > 5. blkdiscard also return EBUSY both to dm-1 and sdb > # blkdiscard -o 0 -l 4096 /dev/dm-1 > blkdiscard: cannot open /dev/dm-1: Device or resource busy > > # blkdiscard -o 0 -l 4096 /dev/sdb > blkdiscard: cannot open /dev/sdb: No such file or directory > > 6. dd write zero is good, because it does not use blkdiscard > # dd if=/dev/zero of=/dev/dm-1 bs=1M count=100 oflag=direct > 100+0 records in > 100+0 records out > 104857600 bytes (105 MB, 100 MiB) copied, 2.33623 s, 44.9 MB/s > > 7. The LUN should support blkdiscard feature, otherwise it will not write > zero > with ioctl(fd, BLKZEROOUT, range) > Thanks! I could not reproduce this with kernel 5.10, but now I'm no 5.11: # uname -r 5.11.7-100.fc32.x86_64 # qemu-img --version qemu-img version 5.2.0 (qemu-5.2.0-6.fc32.1) Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers # cat /etc/multipath.conf defaults { user_friendly_names no find_multipaths no } blacklist_exceptions { property "(SCSI_IDENT_|ID_WWN)" } blacklist { } # multipath -ll 36001405e884ab8ff4b44fdba6901099c 36001405e884ab8ff4b44fdba6901099c dm-8 LIO-ORG,3-09 size=6.0G features='0' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=50 status=active `- 1:0:0:9 sdk 8:160 active ready running $ lsblk /dev/sdk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdk 8:160 0 6G 0 disk └─36001405e884ab8ff4b44fdba6901099c 253:13 0 6G 0 mpath $ virt-builder fedora-32 -o disk.img [ 2.9] Downloading: http://builder.libguestfs.org/fedora-32.xz [ 3.8] Planning how to build this image [ 3.8] Uncompressing [ 11.1] Opening the new disk [ 16.1] Setting a random seed [ 16.1] Setting passwords virt-builder: Setting random password of root to TcikQqRxAaIqS1kF [ 17.0] Finishing off Output file: disk.img Output size: 6.0G Output format: raw Total usable space: 5.4G Free space: 4.0G (74%) $ qemu-img info disk.img image: disk.img file format: raw virtual size: 6 GiB (6442450944 bytes) disk size: 1.29 GiB # qemu-img convert -p -f raw -O raw -t none -T none disk.img /dev/mapper/36001405e884ab8ff4b44fdba6901099c (100.00/100%) Works. # lsblk /dev/sdk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdk 8:160 0 6G 0 disk └─36001405e884ab8ff4b44fdba6901099c 253:13 0 6G 0 mpath ├─36001405e884ab8ff4b44fdba6901099c1 253:14 0 1M 0 part ├─36001405e884ab8ff4b44fdba6901099c2 253:15 0 1G 0 part ├─36001405e884ab8ff4b44fdba6901099c3 253:16 0 615M 0 part └─36001405e884ab8ff4b44fdba6901099c4 253:17 0 4.4G 0 part # qemu-img convert -p -f raw -O raw -t none -T none disk.img /dev/mapper/36001405e884ab8ff4b44fdba6901099c (100.00/100%) Works, wiping the partitions. # mount /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 /tmp/mnt # mount | grep /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 /dev/mapper/36001405e884ab8ff4b44fdba6901099c4 on /tmp/mnt type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) This is invalid use, converting to device with mounted file system, but let's try: # qemu-img convert -p -f raw -O raw -t none -T none disk.img /dev/mapper/36001405e884ab8ff4b44fdba6901099c (100.00/100%) Still works?! # wipefs -a /dev/mapper/36001405e884ab8ff4b44fdba6901099c wipefs: error: /dev/mapper/36001405e884ab8ff4b44fdba6901099c: probing initialization failed: Device or resource busy # blkdiscard /dev/mapper/36001405e884ab8ff4b44fdba6901099c Works. This the configuration of the LUN on the server side: { "aio": false, "alua_tpgs": [ { "alua_access_state": 0, "alua_access_status": 0, "alua_access_type": 3, "alua_support_active_nonoptimized": 1, "alua_support_active_optimized": 1, "alua_support_offline": 1, "alua_support_standby": 1, "alua_support_transitioning": 1, "alua_support_unavailable": 1, "alua_write_metadata": 0, "implicit_trans_secs": 0, "name": "default_tg_pt_gp", "nonop_delay_msecs": 100, "preferred": 0, "tg_pt_gp_id": 0, "trans_delay_msecs": 0 } ], "attributes": { "block_size": 512, "emulate_3pc": 1, "emulate_caw": 1, "emulate_dpo": 1, "emulate_fua_read": 1, "emulate_fua_write": 1, "emulate_model_alias": 1, "emulate_pr": 1, "emulate_rest_reord": 0, "emulate_tas": 1, "emulate_tpu": 1, "emulate_tpws": 1, "emulate_ua_intlck_ctrl": 0, "emulate_write_cache": 1, "enforce_pr_isids": 1, "force_pr_aptpl": 0, "is_nonrot": 0, "max_unmap_block_desc_count": 1, "max_unmap_lba_count": 8192, "max_write_same_len": 65335, "optimal_sectors": 16384, "pi_prot_format": 0, "pi_prot_type": 0, "pi_prot_verify": 0, "queue_depth": 128, "unmap_granularity": 1, "unmap_granularity_alignment": 0, "unmap_zeroes_data": 0 }, "dev": "/target/3/09", "name": "3-09", "plugin": "fileio", "size": 6442450944, "write_back": true, "wwn": "e884ab8f-f4b4-4fdb-a690-1099c072c86d" }, Maybe this upstream change is not in all downstream 5.11 kernels, or 5.11.7 already includes the fix? Adding Ben, maybe he had more insight on the multipath side. >If I understand the kernel change correctly, this can happen when there is > >a mounted file system on top of the multipath device. I don't think we > have > >a use case when qemu accesses a multipath device when the device is used > >by a file system, but maybe I missed something. > > > >So that to me implies > >that we actually should not retry BLKZEROOUT, because the EBUSY will > >remain, and that condition won’t change while the block device is in use > >by qemu. > > > >On the other hand, in the code, you have decided not to reset > >has_write_zeroes to false, so the implementation will retry. > > > >EBUSY is usually a temporary error, so retrying makes sense. The question > >is if we really can write zeroes manually in this case? > > > >So I don’t quite understand. Should we keep trying BLKZEROOUT or is > >there no chance of it working after it has at one point failed with > >EBUSY? (Are there other cases besides what’s described in this commit > >message where EBUSY might be returned and it is only temporary?) > > > >> Fallback to pwritev instead of exit for -EBUSY error. > >> > >> The issue was introduced in Linux 5.10: > >> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=384d87ef2c954fc58e6c5fd8253e4a1984f5fe02 > >> > >> Fixed in Linux 5.12: > >> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56887cffe946bb0a90c74429fa94d6110a73119d > >> > >> Signed-off-by: ChangLimin <chan...@chinatelecom.cn> > >> --- > >> block/file-posix.c | 8 ++++++-- > >> 1 file changed, 6 insertions(+), 2 deletions(-) > >> > >> diff --git a/block/file-posix.c b/block/file-posix.c > >> index 20e14f8e96..d4054ac9cb 100644 > >> --- a/block/file-posix.c > >> +++ b/block/file-posix.c > >> @@ -1624,8 +1624,12 @@ static ssize_t > >> handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb) > >> } while (errno == EINTR); > >> > >> ret = translate_err(-errno); > >> - if (ret == -ENOTSUP) { > >> - s->has_write_zeroes = false; > >> + switch (ret) { > >> + case -ENOTSUP: > >> + s->has_write_zeroes = false; /* fall through */ > >> + case -EBUSY: /* Linux 5.10/5.11 may return -EBUSY for > multipath > >> devices */ > >> + return -ENOTSUP; > >> + break; > > > >(Not sure why this break is here.) > > > >Max > > > >> } > >> } > >> #endif > >> -- > >> 2.27.0 > >> > >