On 01.11.19 14:36, Denis Lunev wrote: > On 11/1/19 4:09 PM, Vladimir Sementsov-Ogievskiy wrote: >> 01.11.2019 15:34, Max Reitz wrote: >>> On 01.11.19 12:20, Max Reitz wrote: >>>> On 01.11.19 12:16, Vladimir Sementsov-Ogievskiy wrote: >>>>> 01.11.2019 14:12, Max Reitz wrote: >>>>>> On 01.11.19 11:28, Vladimir Sementsov-Ogievskiy wrote: >>>>>>> 01.11.2019 13:20, Max Reitz wrote: >>>>>>>> On 01.11.19 11:00, Max Reitz wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> This series builds on the previous RFC. The workaround is now applied >>>>>>>>> unconditionally of AIO mode and filesystem because we don’t know those >>>>>>>>> things for remote filesystems. Furthermore, >>>>>>>>> bdrv_co_get_self_request() >>>>>>>>> has been moved to block/io.c. >>>>>>>>> >>>>>>>>> Applying the workaround unconditionally is fine from a performance >>>>>>>>> standpoint, because it should actually be dead code, thanks to patch 1 >>>>>>>>> (the elephant in the room). As far as I know, there is no other block >>>>>>>>> driver but qcow2 in handle_alloc_space() that would submit zero writes >>>>>>>>> as part of normal I/O so it can occur concurrently to other write >>>>>>>>> requests. It still makes sense to take the workaround for file-posix >>>>>>>>> because we can’t really prevent that any other block driver will >>>>>>>>> submit >>>>>>>>> zero writes as part of normal I/O in the future. >>>>>>>>> >>>>>>>>> Anyway, let’s get to the elephant. >>>>>>>>> >>>>>>>>> From input by XFS developers >>>>>>>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1765547#c7) it seems >>>>>>>>> clear >>>>>>>>> that c8bb23cbdbe causes fundamental performance problems on XFS with >>>>>>>>> aio=native that cannot be fixed. In other cases, c8bb23cbdbe improves >>>>>>>>> performance or we wouldn’t have it. >>>>>>>>> >>>>>>>>> In general, avoiding performance regressions is more important than >>>>>>>>> improving performance, unless the regressions are just a minor corner >>>>>>>>> case or insignificant when compared to the improvement. The XFS >>>>>>>>> regression is no minor corner case, and it isn’t insignificant. >>>>>>>>> Laurent >>>>>>>>> Vivier has found performance to decrease by as much as 88 % (on >>>>>>>>> ppc64le, >>>>>>>>> fio in a guest with 4k blocks, iodepth=8: 1662 kB/s from 13.9 MB/s). >>>>>>>> Ah, crap. >>>>>>>> >>>>>>>> I wanted to send this series as early today as possible to get as much >>>>>>>> feedback as possible, so I’ve only started doing benchmarks now. >>>>>>>> >>>>>>>> The obvious >>>>>>>> >>>>>>>> $ qemu-img bench -t none -n -w -S 65536 test.qcow2 >>>>>>>> >>>>>>>> on XFS takes like 6 seconds on master, and like 50 to 80 seconds with >>>>>>>> c8bb23cbdbe reverted. So now on to guest tests... >>>>>>> Aha, that's very interesting) What about aio-native which should be >>>>>>> slowed down? >>>>>>> Could it be tested like this? >>>>>> That is aio=native (-n). >>>>>> >>>>>> But so far I don’t see any significant difference in guest tests (i.e., >>>>>> fio --rw=write --bs=4k --iodepth=8 --runtime=1m --direct=1 >>>>>> --ioengine=libaio --thread --numjobs=16 --size=2G --time_based), neither >>>>>> with 64 kB nor with 2 MB clusters. (But only on XFS, I’ll have to see >>>>>> about ext4 still.) >>>>> hmm, this possibly mostly tests writes to already allocated clusters. Has >>>>> fio >>>>> an option to behave like qemu-img bench with -S 65536, i.e. write once >>>>> into >>>>> each cluster? >>>> Maybe, but is that a realistic depiction of whether this change is worth >>>> it? That is why I’m doing the guest test, to see whether it actually >>>> has much impact on the guest. >>> I’ve changed the above fio invocation to use --rw=randwrite and added >>> --fallocate=none. The performance went down, but it went down both with >>> and without c8bb23cbdbe. >>> >>> So on my XFS system (XFS on luks on SSD), I see: >>> - with c8bb23cbdbe: 26.0 - 27.9 MB/s >>> - without c8bb23cbdbe: 25.6 - 27 MB/s >>> >>> On my ext4 system (native on SSD), I see: >>> - with: 39.4 - 41.5 MB/s >>> - without: 39.4 - 42.0 MB/s >>> >>> So basically no difference for XFS, and really no difference for ext4. >>> (I ran these tests with 2 MB clusters.) >>> >> Hmm. I don't know. For me it seems obvious that zeroing 2M cluster is slow, >> and this >> is proved by simple tests with qemu-img bench, that fallocate is faster than >> zeroing >> most of the cluster. >> >> So, if some guest test doesn't show the difference, this means that "small >> write into >> new cluster" is effectively rare case in this test.. And this doesn't prove >> that it's >> always rare and insignificant. >> >> I don't sure that we have a real-world example that proves necessity of this >> optimization, >> or was there some original bug about low-performance which was fixed by this >> optimization.. >> Den, Anton, do we have something about it? >> > sorry, I have missed the beginning of the thread. > > Which driver is used for virtual disk - cached or non-cached IO > is used in QEMU? We use non-cached by default and this could > make a difference significantly.
I’m using no cache, the above tests were done with aio=native; I’ve sent another response with aio=threads numbers. > Max, > > can you pls share your domain.xml of the guest config and > fio file for guest. I will recheck to be 120% sure. I’m running qemu directly as follows: x86_64-softmmu/qemu-system-x86_64 \ -serial stdio \ -cdrom ~/tmp/arch.iso \ -m 4096 \ -enable-kvm \ -drive \ if=none,id=t,format=qcow2,file=test/test.qcow2,cache=none,aio=native \ -device virtio-scsi \ -device scsi-hd,drive=t \ -net user \ -net nic,model=rtl8139 \ -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 \ -cpu SandyBridge \ -nodefaults \ -nographic The full FIO command line is: fio --rw=randwrite --bs=4k --iodepth=8 --runtime=1m --direct=1 \ --filename=/mnt/foo --name=job1 --ioengine=libaio --thread \ --group_reporting --numjobs=16 --size=2G --time_based \ --output=/tmp/fio_result --fallocate=none Max
signature.asc
Description: OpenPGP digital signature