file-posix: Work around XFS bug

Vladimir Sementsov-Ogievskiy Fri, 25 Oct 2019 06:43:41 -0700

25.10.2019 12:58, Max Reitz wrote:
> Hi,
> 
> It seems to me that there is a bug in Linux’s XFS kernel driver, as
> I’ve explained here:
> 
> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01429.html
> 
> In combination with our commit c8bb23cbdbe32f, this may lead to guest
> data corruption when using qcow2 images on XFS with aio=native.
> 
> We can’t wait until the XFS kernel driver is fixed, we should work
> around the problem ourselves.
> 
> This is an RFC for two reasons:
> (1) I don’t know whether this is the right way to address the issue,
> (2) Ideally, we should detect whether the XFS kernel driver is fixed and
>      if so stop applying the workaround.
>      I don’t know how we would go about this, so this series doesn’t do
>      it.  (Hence it’s an RFC.)
> (3) Perhaps it’s a bit of a layering violation to let the file-posix
>      driver access and modify a BdrvTrackedRequest object.
> 
> As for how we can address the issue, I see three ways:
> (1) The one presented in this series: On XFS with aio=native, we extend
>      tracked requests for post-EOF fallocate() calls (i.e., write-zero
>      operations) to reach until infinity (INT64_MAX in practice), mark
>      them serializing and wait for other conflicting requests.
> 
>      Advantages:
>      + Limits the impact to very specific cases
>        (And that means it wouldn’t hurt too much to keep this workaround
>        even when the XFS driver has been fixed)
>      + Works around the bug where it happens, namely in file-posix
> 
>      Disadvantages:
>      - A bit complex
>      - A bit of a layering violation (should file-posix have access to
>        tracked requests?)
> 
> (2) Always skip qcow2’s handle_alloc_space() on XFS.  The XFS bug only
>      becomes visible due to that function: I don’t think qcow2 writes
>      zeroes in any other I/O path, and raw images are fixed in size so
>      post-EOF writes won’t happen.
> 
>      Advantages:
>      + Maybe simpler, depending on how difficult it is to handle the
>        layering violation
>      + Also fixes the performance problem of handle_alloc_space() being
>        slow on ppc64+XFS.
> 
>      Disadvantages:
>      - Huge layering violation because qcow2 would need to know whether
>        the image is stored on XFS or not.
>      - We’d definitely want to skip this workaround when the XFS driver
>        has been fixed, so we need some method to find out whether it has
> 
> (3) Drop handle_alloc_space(), i.e. revert c8bb23cbdbe32f.
>      To my knowledge I’m the only one who has provided any benchmarks for
>      this commit, and even then I was a bit skeptical because it performs
>      well in some cases and bad in others.  I concluded that it’s
>      probably worth it because the “some cases” are more likely to occur.
> 
>      Now we have this problem of corruption here (granted due to a bug in
>      the XFS driver), and another report of massively degraded
>      performance on ppc64
>      (https://bugzilla.redhat.com/show_bug.cgi?id=1745823 – sorry, a
>      private BZ; I hate that :-/  The report is about 40 % worse
>      performance for an in-guest fio write benchmark.)
> 
>      So I have to ask the question about what the justification for
>      keeping c8bb23cbdbe32f is.  How much does performance increase with
>      it actually?  (On non-(ppc64+XFS) machines, obviously)
> 
>      Advantages:
>      + Trivial
>      + No layering violations
>      + We wouldn’t need to keep track of whether the kernel bug has been
>        fixed or not
>      + Fixes the ppc64+XFS performance problem
> 
>      Disadvantages:
>      - Reverts cluster allocation performance to pre-c8bb23cbdbe32f
>        levels, whatever that means
> 
> So this is the main reason this is an RFC: What should we do?  Is (1)
> really the best choice?
> 
> 
> In any case, I’ve ran the test case I showed in
> https://lists.nongnu.org/archive/html/qemu-block/2019-10/msg01282.html
> more than ten times with this series applied and the installation
> succeeded every time.  (Without this series, it fails like every other
> time.)
> 
>


Hi!

First, great thanks for your investigation!

We need c8bb23cbdbe3 patch, because we use 1M clusters, and zeroing 1M is 
significant
in time.

I've tested a bit:

test:
for img in /ssd/test.img /test.img; do for cl in 64K 1M; do for step in 4K 64K 
1M; do ./qemu-img create -f qcow2 -o cluster_size=$cl $img 15G > /dev/null; 
printf '%-15s%-7s%-10s : ' $img cl=$cl step=$step; ./qemu-img bench -c $((15 * 
1024)) -n -s 4K -S $step -t none -w $img | tail -1 | awk '{print $4}'; done; 
done; done

on master:

/ssd/test.img  cl=64K step=4K    : 0.291
/ssd/test.img  cl=64K step=64K   : 0.813
/ssd/test.img  cl=64K step=1M    : 2.799
/ssd/test.img  cl=1M  step=4K    : 0.217
/ssd/test.img  cl=1M  step=64K   : 0.332
/ssd/test.img  cl=1M  step=1M    : 0.685
/test.img      cl=64K step=4K    : 1.751
/test.img      cl=64K step=64K   : 14.811
/test.img      cl=64K step=1M    : 18.321
/test.img      cl=1M  step=4K    : 0.759
/test.img      cl=1M  step=64K   : 13.574
/test.img      cl=1M  step=1M    : 28.970

rerun on master:

/ssd/test.img  cl=64K step=4K    : 0.295
/ssd/test.img  cl=64K step=64K   : 0.803
/ssd/test.img  cl=64K step=1M    : 2.921
/ssd/test.img  cl=1M  step=4K    : 0.233
/ssd/test.img  cl=1M  step=64K   : 0.321
/ssd/test.img  cl=1M  step=1M    : 0.762
/test.img      cl=64K step=4K    : 1.873
/test.img      cl=64K step=64K   : 15.621
/test.img      cl=64K step=1M    : 18.428
/test.img      cl=1M  step=4K    : 0.883
/test.img      cl=1M  step=64K   : 13.484
/test.img      cl=1M  step=1M    : 26.244


on master + revert c8bb23cbdbe32f5c326

/ssd/test.img  cl=64K step=4K    : 0.395
/ssd/test.img  cl=64K step=64K   : 4.231
/ssd/test.img  cl=64K step=1M    : 5.598
/ssd/test.img  cl=1M  step=4K    : 0.352
/ssd/test.img  cl=1M  step=64K   : 2.519
/ssd/test.img  cl=1M  step=1M    : 38.919
/test.img      cl=64K step=4K    : 1.758
/test.img      cl=64K step=64K   : 9.838
/test.img      cl=64K step=1M    : 13.384
/test.img      cl=1M  step=4K    : 1.849
/test.img      cl=1M  step=64K   : 19.405
/test.img      cl=1M  step=1M    : 157.090

rerun:

/ssd/test.img  cl=64K step=4K    : 0.407
/ssd/test.img  cl=64K step=64K   : 3.325
/ssd/test.img  cl=64K step=1M    : 5.641
/ssd/test.img  cl=1M  step=4K    : 0.346
/ssd/test.img  cl=1M  step=64K   : 2.583
/ssd/test.img  cl=1M  step=1M    : 39.692
/test.img      cl=64K step=4K    : 1.727
/test.img      cl=64K step=64K   : 10.058
/test.img      cl=64K step=1M    : 13.441
/test.img      cl=1M  step=4K    : 1.926
/test.img      cl=1M  step=64K   : 19.738
/test.img      cl=1M  step=1M    : 158.268


So, it's obvious that c8bb23cbdbe32f5c326 is significant for 1M cluster-size, 
even on rotational
disk, which means that previous assumption about calling handle_alloc_space() 
only for ssd is
wrong, we need smarter heuristics..

So, I'd prefer (1) or (2).

-- 
Best regards,
Vladimir

Re: [RFC 0/3] block/file-posix: Work around XFS bug

Reply via email to