25.10.2019 17:19, Max Reitz wrote: > On 25.10.19 15:56, Vladimir Sementsov-Ogievskiy wrote: >> 25.10.2019 16:40, Vladimir Sementsov-Ogievskiy wrote: >>> 25.10.2019 12:58, Max Reitz wrote: >>>> Hi, >>>> >>>> It seems to me that there is a bug in Linux’s XFS kernel driver, as >>>> I’ve explained here: >>>> >>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html >>>> >>>> In combination with our commit c8bb23cbdbe32f, this may lead to guest >>>> data corruption when using qcow2 images on XFS with aio=native. >>>> >>>> We can’t wait until the XFS kernel driver is fixed, we should work >>>> around the problem ourselves. >>>> >>>> This is an RFC for two reasons: >>>> (1) I don’t know whether this is the right way to address the issue, >>>> (2) Ideally, we should detect whether the XFS kernel driver is fixed and >>>> if so stop applying the workaround. >>>> I don’t know how we would go about this, so this series doesn’t do >>>> it. (Hence it’s an RFC.) >>>> (3) Perhaps it’s a bit of a layering violation to let the file-posix >>>> driver access and modify a BdrvTrackedRequest object. >>>> >>>> As for how we can address the issue, I see three ways: >>>> (1) The one presented in this series: On XFS with aio=native, we extend >>>> tracked requests for post-EOF fallocate() calls (i.e., write-zero >>>> operations) to reach until infinity (INT64_MAX in practice), mark >>>> them serializing and wait for other conflicting requests. >>>> >>>> Advantages: >>>> + Limits the impact to very specific cases >>>> (And that means it wouldn’t hurt too much to keep this workaround >>>> even when the XFS driver has been fixed) >>>> + Works around the bug where it happens, namely in file-posix >>>> >>>> Disadvantages: >>>> - A bit complex >>>> - A bit of a layering violation (should file-posix have access to >>>> tracked requests?) >>>> >>>> (2) Always skip qcow2’s handle_alloc_space() on XFS. The XFS bug only >>>> becomes visible due to that function: I don’t think qcow2 writes >>>> zeroes in any other I/O path, and raw images are fixed in size so >>>> post-EOF writes won’t happen. >>>> >>>> Advantages: >>>> + Maybe simpler, depending on how difficult it is to handle the >>>> layering violation >>>> + Also fixes the performance problem of handle_alloc_space() being >>>> slow on ppc64+XFS. >>>> >>>> Disadvantages: >>>> - Huge layering violation because qcow2 would need to know whether >>>> the image is stored on XFS or not. >>>> - We’d definitely want to skip this workaround when the XFS driver >>>> has been fixed, so we need some method to find out whether it has >>>> >>>> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f. >>>> To my knowledge I’m the only one who has provided any benchmarks for >>>> this commit, and even then I was a bit skeptical because it performs >>>> well in some cases and bad in others. I concluded that it’s >>>> probably worth it because the “some cases” are more likely to occur. >>>> >>>> Now we have this problem of corruption here (granted due to a bug in >>>> the XFS driver), and another report of massively degraded >>>> performance on ppc64 >>>> (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a >>>> private BZ; I hate that :-/ The report is about 40 % worse >>>> performance for an in-guest fio write benchmark.) >>>> >>>> So I have to ask the question about what the justification for >>>> keeping c8bb23cbdbe32f is. How much does performance increase with >>>> it actually? (On non-(ppc64+XFS) machines, obviously) >>>> >>>> Advantages: >>>> + Trivial >>>> + No layering violations >>>> + We wouldn’t need to keep track of whether the kernel bug has been >>>> fixed or not >>>> + Fixes the ppc64+XFS performance problem >>>> >>>> Disadvantages: >>>> - Reverts cluster allocation performance to pre-c8bb23cbdbe32f >>>> levels, whatever that means >>>> >>>> So this is the main reason this is an RFC: What should we do? Is (1) >>>> really the best choice? >>>> >>>> >>>> In any case, I’ve ran the test case I showed in >>>> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html >>>> more than ten times with this series applied and the installation >>>> succeeded every time. (Without this series, it fails like every other >>>> time.) >>>> >>>> >>> >>> Hi! >>> >>> First, great thanks for your investigation! >>> >>> We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is >>> significant >>> in time. >>> >>> I've tested a bit: >>> >>> test: >>> for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K >>> 64K 1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > >>> /dev/null; printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img >>> bench -c $((15 * 1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk >>> '{print $4}'; done; done; done >>> >>> on master: >>> >>> /ssd/test.img cl=64K step=4K : 0.291 >>> /ssd/test.img cl=64K step=64K : 0.813 >>> /ssd/test.img cl=64K step=1M : 2.799 >>> /ssd/test.img cl=1M step=4K : 0.217 >>> /ssd/test.img cl=1M step=64K : 0.332 >>> /ssd/test.img cl=1M step=1M : 0.685 >>> /test.img cl=64K step=4K : 1.751 >>> /test.img cl=64K step=64K : 14.811 >>> /test.img cl=64K step=1M : 18.321 >>> /test.img cl=1M step=4K : 0.759 >>> /test.img cl=1M step=64K : 13.574 >>> /test.img cl=1M step=1M : 28.970 >>> >>> rerun on master: >>> >>> /ssd/test.img cl=64K step=4K : 0.295 >>> /ssd/test.img cl=64K step=64K : 0.803 >>> /ssd/test.img cl=64K step=1M : 2.921 >>> /ssd/test.img cl=1M step=4K : 0.233 >>> /ssd/test.img cl=1M step=64K : 0.321 >>> /ssd/test.img cl=1M step=1M : 0.762 >>> /test.img cl=64K step=4K : 1.873 >>> /test.img cl=64K step=64K : 15.621 >>> /test.img cl=64K step=1M : 18.428 >>> /test.img cl=1M step=4K : 0.883 >>> /test.img cl=1M step=64K : 13.484 >>> /test.img cl=1M step=1M : 26.244 >>> >>> >>> on master + revert c8bb23cbdbe32f5c326 >>> >>> /ssd/test.img cl=64K step=4K : 0.395 >>> /ssd/test.img cl=64K step=64K : 4.231 >>> /ssd/test.img cl=64K step=1M : 5.598 >>> /ssd/test.img cl=1M step=4K : 0.352 >>> /ssd/test.img cl=1M step=64K : 2.519 >>> /ssd/test.img cl=1M step=1M : 38.919 >>> /test.img cl=64K step=4K : 1.758 >>> /test.img cl=64K step=64K : 9.838 >>> /test.img cl=64K step=1M : 13.384 >>> /test.img cl=1M step=4K : 1.849 >>> /test.img cl=1M step=64K : 19.405 >>> /test.img cl=1M step=1M : 157.090 >>> >>> rerun: >>> >>> /ssd/test.img cl=64K step=4K : 0.407 >>> /ssd/test.img cl=64K step=64K : 3.325 >>> /ssd/test.img cl=64K step=1M : 5.641 >>> /ssd/test.img cl=1M step=4K : 0.346 >>> /ssd/test.img cl=1M step=64K : 2.583 >>> /ssd/test.img cl=1M step=1M : 39.692 >>> /test.img cl=64K step=4K : 1.727 >>> /test.img cl=64K step=64K : 10.058 >>> /test.img cl=64K step=1M : 13.441 >>> /test.img cl=1M step=4K : 1.926 >>> /test.img cl=1M step=64K : 19.738 >>> /test.img cl=1M step=1M : 158.268 >>> >>> >>> So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M >>> cluster-size, even on rotational >>> disk, which means that previous assumption about calling >>> handle_alloc_space() only for ssd is >>> wrong, we need smarter heuristics.. >>> >>> So, I'd prefer (1) or (2). > > OK. I wonder whether that problem would go away with Berto’s subcluster > series, though.
Very possible, I thought about it too. > >> About degradation in some cases: I think the problem is that one (a bit >> larger) >> write may be faster than fast-write-zeroes + small write, as the latter means >> additional write to metadata. And it's expected for small clusters in >> conjunction with rotational disk. But the actual limit is dependent on >> specific >> disk. So, I think possible solution is just sometimes try work with >> handle_alloc_space and sometimes without, remember time and length of request >> and make dynamic limit... > > Maybe make a decision based both on the ratio of data size to COW area > length (only invoke handle_alloc_space() under a certain threshold), and > the absolute COW area length (always invoke it above a certain > threshold, unless the ratio doesn’t allow it)? > Yes, something like this.. without handle_alloc_space, time = time(write aligned up to cluster) with handle_alloc_space, time = time(fast zero write) + time(original write) If we have some statistics on normal-write vs zero-write timing, we can just calculate both variants and choose faster. if (predict_zero_write_time(aligned up request) + predict_write_time(request) < predict_write_time(aligned up request)) { use handle_alloc_space() } -- Best regards, Vladimir