Bug#908216: btrfs blocked for more than 120 seconds

Nicholas D Steeves Mon, 25 Feb 2019 20:21:28 -0800

Control: tags -1 -unreproducible

Hi Russell,

Thank you for providing more info.  Now I see where you're running
into known limitations with btrfs (all versions).  Reply follows inline.

BTW, you're not using SMR and/or USB disks, right?

On Mon, Feb 25, 2019 at 12:33:51PM -0600, Russell Mosemann wrote:
>    Steps to reproduce
> 
>    Simply copying a file into the file system can cause things to lock up. In
>    this case, the files will usually be thin-provisioned qcow2 disks for kvm
>    vm's. There is no detailed formula to force the lockup to occur, but it
>    happens regularly, sometimes multiple times in one day.
>

Have you read https://wiki.debian.org/Btrfs ?  Specifically "COW on
COW: Don't do it!" ?  If you did read it, maybe the document needs to
be more firm about this...  eg: "take care to use raw images" should
be "under no circumstances use non-raw images".  P.S. Yes, I know that
page would benefit from a reorganisation...  Sorry about it's current
state.

>    Files are often copied from a master by reference (cp --reflink), one per
>    day to perform a daily backup for up to 45 days. Removing older files is a
>    painfully slow process, even though there are only 45 files in the
>    directory. Doing a scrub is almost a sure way to lock up the system,
>    especially if a copy or delete operation is in progress. On two systems,
>    crashes occur with 4.18 and 4.19 but not 4.17. On the other systems that
>    crash, it does not seem to matter if it is 4.17, 4.18 or 4.19.
>

It might be that >4.17 fixed some corner-case corruption issue, for
example by adding an additional check during each step of a backref
walk, and that this makes the timeout more frequent and severe. eg:
4.17 works because it is less strict.

By the way, is it your VM host that locks up, or your VM guests?  Do[es]
they[it] recover if you leave it alone for many hours?  I didn't see
any oopses or panics in your kernel logs.

Reflinked file are like snapshots, any I/O on a file must walk every branch
of the backref tree that is relevant to a file.  For more info see:
  https://btrfs.wiki.kernel.org/index.php/Resolving_Extent_Backrefs

As the tree grows and becomes more complex, a COW fs will get slower.
You've hit the >120sec threshold, due to one or more of the issues
discussed in this email.  eg: a scrub, even during a file copy/delete
should never cause this timeout.  I haven't experienced one since
linux-4.4.x or 4.9.x...

To get a figure that will provide a sense of scale to how many
operations it takes to do anything other than reflink or snapshot you
can consult the output of:

  filefrag each_live_copy_vm_image

I expect the number of extends will exceed tens of thousands.  BTW,
you can use btrfsmaintenance to periodically defrag the source (and
only the source) images.  Note that this will break reflinks between
SOURCE and each of the 45 REFLINKED-COPIES, but not between
REFLINK-COPY1 and REFLINK-COPY2.  Defragging weekly strikes a nice
balance between lost space efficiency (due to fewer shared references
between today's backup and yesterday's) and avoiding the performance
issue you've encountered.  Mounting with autodefrag is the least space
efficient.  (P.S. Also, I don't trust autodefrag)

IIRC btrfs-debug-tree can accurately count references.

>    Unless otherwise indicated
> 
>    using qgroups:        No
>

Whew, thank you for not! :-)  Qgroups make this kind of issue worse.

>    using compression:    Yes, compress-force=zstd
>

If userspace CPU usage is already high then compression may introduce
additional latency and contribute to the >120sec warning.

>    number of snapshots:  Zero
>

But 45 reflinked copies per VM.

>    number of subvolumes: top level subvolume only
>

I believe it was Chris Murphy who wrote (on linux-btrfs) about how
segmenting different functions/datasets into different
non-hierarchically structured (eg: flat layout) subvolumes reduces
lock contention during backref walks.  This is a performance tuning
tip that needs to be investigated and integrated into the wiki
article.  Eg:

           _____id:5 top-level_____   <-either unmounted, or           
          /    |     |     |       \    mounted somewhere like
         /     |     |     |        \   /.btrfs-admin, /.volume, etc.
        /      |     |     |         \
host_rootfs   VM0   VM1   VM2   data_shared_between_VMs

>    raid profile:         None
> 
>    using bcache:         No
>

Thanks.

>    layered on md or lvm: No
>

But layered on hardware raid?

>    vhost002
> 
>    # grep btrfs /etc/mtab
>    /dev/sdc1 /usr/local/data/datastore2 btrfs
>    rw,noatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0
> 

[1] Thank you for using noatime.  Please explain this configuration.
eg: does each VM have three virtual disks (containing btrfs volumes),
backed by qcow2 images, backed by a btrfs volume on the VM host?  I
thought you were using qcow2 images, but this looks like passthrough
of some kind.

If the former:

Every write causes a COW operation in the inner btrfs, and the qcow2,
and the outer btrfs volume.  The inner btrfs volume compresses once,
and then the outer btrfs volume will try to compress again.

>    # smartctl -l scterc /dev/sdc
>    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local
>    build)
>    Copyright (C) 2002-16, Bruce Allen, Christian Franke,
>    www.smartmontools.org
> 
>    SCT Error Recovery Control:
>               Read: Disabled
>              Write: Disabled
>

[2] Have you seen any SATA resets in your logs?  The default kernel
timeout is 30sec, and drives without SCR ERT can sometimes take an
undefined (though generally under 180sec) amount of time to reattempt
to read a block...and if it's an SMR drive with writing I/O the delay
to successful read can be even worse.

>    vhost003
> 
>    # grep btrfs /etc/mtab
>    /dev/sdb4 /usr/local/data/datastore2 btrfs
>    rw,relatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0
>

[1]  Also, why aren't you using noatime here too?

>    (RAID controller)
> 
>    # smartctl -l scterc /dev/sdb
>    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.17.0-0.bpo.1-amd64] (local
>    build)
>    Copyright (C) 2002-16, Bruce Allen, Christian Franke,
>    www.smartmontools.org
>

[4] Ok, so sdb is the raid controller on the host?  And sdb is passed
through, and you shutdown one VM before mounting the same
btrfs-on-hardware_RAID partition in another VM?

>    vhost004
> 
>    # grep btrfs /etc/mtab
>    /dev/sdb4 /usr/local/data/datastore2 btrfs
>    rw,relatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0
>

[3] [1].  Also, why aren't you using noatime here too?

>    (RAID controller)
> 
>    # smartctl -l scterc /dev/sdb
>    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.17.0-0.bpo.1-amd64] (local
>    build)
>    Copyright (C) 2002-16, Bruce Allen, Christian Franke,
>    www.smartmontools.org
>
>     
> 
>    # btrfs dev stats /usr/local/data/datastore2
>    [/dev/sdb4].write_io_errs   0
>    [/dev/sdb4].read_io_errs    0
>    [/dev/sdb4].flush_io_errs   0
>    [/dev/sdb4].corruption_errs 0
>    [/dev/sdb4].generation_errs 0
>

[4] So vhost03 and vhost04 mount the same partition from the raid
controller on the host via passthrough?  At the same time?

>    vhost031
> 
>    # grep btrfs /etc/mtab
>    /dev/sdc1 /usr/local/data/datastore2 btrfs
>    rw,relatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0
>

[3] [4]

>    # smartctl -l scterc /dev/sdc
>    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local
>    build)
>    Copyright (C) 2002-16, Bruce Allen, Christian Franke,
>    www.smartmontools.org
> 
>    SCT Error Recovery Control:
>               Read: Disabled
>              Write: Disabled
>

[2]

>    vhost032
> 
>    # grep btrfs /etc/mtab
>    /dev/sdc1 /usr/local/data/datastore2 btrfs
>    rw,relatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0
>

[3] [4]

>    # smartctl -l scterc /dev/sdc
>    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local
>    build)
>    Copyright (C) 2002-16, Bruce Allen, Christian Franke,
>    www.smartmontools.org
> 
>    SCT Error Recovery Control:
>               Read: Disabled
>              Write: Disabled
>

Is this sdc a qcow2 image or a passed through megaraid partition?

>    vhost182
> 
>    # grep btrfs /etc/mtab
>    /dev/sdc1 /usr/local/data/datastore2 btrfs
>    rw,noatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0
>

[1]

[snip]

>    lxc008
> 
>    number of subvolumes: 1416

That's *way* too many.  This is a major contributing factor to the
timeouts...

[snip]

>    # grep btrfs /etc/mtab
>    /dev/sdc1 /usr/local/data2 btrfs
>    rw,noatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0
>

Is this in a container rather than a VM?

>    (RAID controller)
> 
>    # smartctl -d megaraid,0 -l scterc /dev/sdc
>    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local
>    build)
>    Copyright (C) 2002-16, Bruce Allen, Christian Franke,
>    www.smartmontools.org
> 
>    Write SCT (Get) Error Recovery Control Command failed: ATA return
>    descriptor not supported by controller firmware
>    SCT (Get) Error Recovery Control command failed
> 

A different raid controller?  Aiie, this is a complex setup... 

>    lxc009
> 
>    # grep btrfs /etc/mtab
>    /dev/sda3 / btrfs rw,relatime,space_cache,subvolid=5,subvol=/ 0 0
>    /dev/sdb1 /usr/local/data2 btrfs
>    rw,noatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0
> 
>     
> 
>    (RAID controller)
> 
>    # smartctl -l scterc /dev/sdb
>    smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local
>    build)
>    Copyright (C) 2002-16, Bruce Allen, Christian Franke,
>    www.smartmontools.org
> 
>     
> 
>    # btrfs dev stats /usr/local/data2
>    [/dev/sdb1].write_io_errs   0
>    [/dev/sdb1].read_io_errs    0
>    [/dev/sdb1].flush_io_errs   0
>    [/dev/sdb1].corruption_errs 0
>    [/dev/sdb1].generation_errs 0
> 

Ok, first decide where you want to reflink/snapshot, either inside the
VMs or outside.

If inside:
  * Host your VM images on an ext4 or xfs partition.
  * Use btrfs inside the VM.
    - Use noatime inside the VM.
  * To get backups onto the host, use the network or a
    passed-through partition.

If outside:
  * Host raw VM images on btrfs (noatime).
  * Use ext4 on xfs inside the VM.
    - Everything is already COWed, checksummed, and compressed on the
      VM host, so it's absolutely not needed here.
    - Also use noatime inside the VM.
  * Periodically defrag the live copy of your VM images.
  ! Note that many on the linux-btrfs mailing list do not recommend
    btrfs for this type of workload, if performance is important.
  ? Maybe the partition pass through is how you're getting around
    this issue?

Hacky "it's too late to rethink this server": Use chattr +C on the VM
images.  Note that the images will no longer be checksummed (see
wiki).

Maybe I've misunderstood, but it looks like you're running btrfs
volumes, on top of qcow2 images, on top of a btrfs host volume.
That's an easy to reproduce recipe for problems of this kind.

Sincerely,
Nicholas

signature.asc
Description: PGP signature

Bug#908216: btrfs blocked for more than 120 seconds

Reply via email to