Control: tags -1 -unreproducible Hi Russell,
Thank you for providing more info. Now I see where you're running into known limitations with btrfs (all versions). Reply follows inline. BTW, you're not using SMR and/or USB disks, right? On Mon, Feb 25, 2019 at 12:33:51PM -0600, Russell Mosemann wrote: > Steps to reproduce > > Simply copying a file into the file system can cause things to lock up. In > this case, the files will usually be thin-provisioned qcow2 disks for kvm > vm's. There is no detailed formula to force the lockup to occur, but it > happens regularly, sometimes multiple times in one day. > Have you read https://wiki.debian.org/Btrfs ? Specifically "COW on COW: Don't do it!" ? If you did read it, maybe the document needs to be more firm about this... eg: "take care to use raw images" should be "under no circumstances use non-raw images". P.S. Yes, I know that page would benefit from a reorganisation... Sorry about it's current state. > Files are often copied from a master by reference (cp --reflink), one per > day to perform a daily backup for up to 45 days. Removing older files is a > painfully slow process, even though there are only 45 files in the > directory. Doing a scrub is almost a sure way to lock up the system, > especially if a copy or delete operation is in progress. On two systems, > crashes occur with 4.18 and 4.19 but not 4.17. On the other systems that > crash, it does not seem to matter if it is 4.17, 4.18 or 4.19. > It might be that >4.17 fixed some corner-case corruption issue, for example by adding an additional check during each step of a backref walk, and that this makes the timeout more frequent and severe. eg: 4.17 works because it is less strict. By the way, is it your VM host that locks up, or your VM guests? Do[es] they[it] recover if you leave it alone for many hours? I didn't see any oopses or panics in your kernel logs. Reflinked file are like snapshots, any I/O on a file must walk every branch of the backref tree that is relevant to a file. For more info see: https://btrfs.wiki.kernel.org/index.php/Resolving_Extent_Backrefs As the tree grows and becomes more complex, a COW fs will get slower. You've hit the >120sec threshold, due to one or more of the issues discussed in this email. eg: a scrub, even during a file copy/delete should never cause this timeout. I haven't experienced one since linux-4.4.x or 4.9.x... To get a figure that will provide a sense of scale to how many operations it takes to do anything other than reflink or snapshot you can consult the output of: filefrag each_live_copy_vm_image I expect the number of extends will exceed tens of thousands. BTW, you can use btrfsmaintenance to periodically defrag the source (and only the source) images. Note that this will break reflinks between SOURCE and each of the 45 REFLINKED-COPIES, but not between REFLINK-COPY1 and REFLINK-COPY2. Defragging weekly strikes a nice balance between lost space efficiency (due to fewer shared references between today's backup and yesterday's) and avoiding the performance issue you've encountered. Mounting with autodefrag is the least space efficient. (P.S. Also, I don't trust autodefrag) IIRC btrfs-debug-tree can accurately count references. > Unless otherwise indicated > > using qgroups: No > Whew, thank you for not! :-) Qgroups make this kind of issue worse. > using compression: Yes, compress-force=zstd > If userspace CPU usage is already high then compression may introduce additional latency and contribute to the >120sec warning. > number of snapshots: Zero > But 45 reflinked copies per VM. > number of subvolumes: top level subvolume only > I believe it was Chris Murphy who wrote (on linux-btrfs) about how segmenting different functions/datasets into different non-hierarchically structured (eg: flat layout) subvolumes reduces lock contention during backref walks. This is a performance tuning tip that needs to be investigated and integrated into the wiki article. Eg: _____id:5 top-level_____ <-either unmounted, or / | | | \ mounted somewhere like / | | | \ /.btrfs-admin, /.volume, etc. / | | | \ host_rootfs VM0 VM1 VM2 data_shared_between_VMs > raid profile: None > > using bcache: No > Thanks. > layered on md or lvm: No > But layered on hardware raid? > vhost002 > > # grep btrfs /etc/mtab > /dev/sdc1 /usr/local/data/datastore2 btrfs > rw,noatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0 > [1] Thank you for using noatime. Please explain this configuration. eg: does each VM have three virtual disks (containing btrfs volumes), backed by qcow2 images, backed by a btrfs volume on the VM host? I thought you were using qcow2 images, but this looks like passthrough of some kind. If the former: Every write causes a COW operation in the inner btrfs, and the qcow2, and the outer btrfs volume. The inner btrfs volume compresses once, and then the outer btrfs volume will try to compress again. > # smartctl -l scterc /dev/sdc > smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local > build) > Copyright (C) 2002-16, Bruce Allen, Christian Franke, > www.smartmontools.org > > SCT Error Recovery Control: > Read: Disabled > Write: Disabled > [2] Have you seen any SATA resets in your logs? The default kernel timeout is 30sec, and drives without SCR ERT can sometimes take an undefined (though generally under 180sec) amount of time to reattempt to read a block...and if it's an SMR drive with writing I/O the delay to successful read can be even worse. > vhost003 > > # grep btrfs /etc/mtab > /dev/sdb4 /usr/local/data/datastore2 btrfs > rw,relatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0 > [1] Also, why aren't you using noatime here too? > (RAID controller) > > # smartctl -l scterc /dev/sdb > smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.17.0-0.bpo.1-amd64] (local > build) > Copyright (C) 2002-16, Bruce Allen, Christian Franke, > www.smartmontools.org > [4] Ok, so sdb is the raid controller on the host? And sdb is passed through, and you shutdown one VM before mounting the same btrfs-on-hardware_RAID partition in another VM? > vhost004 > > # grep btrfs /etc/mtab > /dev/sdb4 /usr/local/data/datastore2 btrfs > rw,relatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0 > [3] [1]. Also, why aren't you using noatime here too? > (RAID controller) > > # smartctl -l scterc /dev/sdb > smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.17.0-0.bpo.1-amd64] (local > build) > Copyright (C) 2002-16, Bruce Allen, Christian Franke, > www.smartmontools.org > > > > # btrfs dev stats /usr/local/data/datastore2 > [/dev/sdb4].write_io_errs 0 > [/dev/sdb4].read_io_errs 0 > [/dev/sdb4].flush_io_errs 0 > [/dev/sdb4].corruption_errs 0 > [/dev/sdb4].generation_errs 0 > [4] So vhost03 and vhost04 mount the same partition from the raid controller on the host via passthrough? At the same time? > vhost031 > > # grep btrfs /etc/mtab > /dev/sdc1 /usr/local/data/datastore2 btrfs > rw,relatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0 > [3] [4] > # smartctl -l scterc /dev/sdc > smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local > build) > Copyright (C) 2002-16, Bruce Allen, Christian Franke, > www.smartmontools.org > > SCT Error Recovery Control: > Read: Disabled > Write: Disabled > [2] > vhost032 > > # grep btrfs /etc/mtab > /dev/sdc1 /usr/local/data/datastore2 btrfs > rw,relatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0 > [3] [4] > # smartctl -l scterc /dev/sdc > smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local > build) > Copyright (C) 2002-16, Bruce Allen, Christian Franke, > www.smartmontools.org > > SCT Error Recovery Control: > Read: Disabled > Write: Disabled > Is this sdc a qcow2 image or a passed through megaraid partition? > vhost182 > > # grep btrfs /etc/mtab > /dev/sdc1 /usr/local/data/datastore2 btrfs > rw,noatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0 > [1] [snip] > lxc008 > > number of subvolumes: 1416 That's *way* too many. This is a major contributing factor to the timeouts... [snip] > # grep btrfs /etc/mtab > /dev/sdc1 /usr/local/data2 btrfs > rw,noatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0 > Is this in a container rather than a VM? > (RAID controller) > > # smartctl -d megaraid,0 -l scterc /dev/sdc > smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local > build) > Copyright (C) 2002-16, Bruce Allen, Christian Franke, > www.smartmontools.org > > Write SCT (Get) Error Recovery Control Command failed: ATA return > descriptor not supported by controller firmware > SCT (Get) Error Recovery Control command failed > A different raid controller? Aiie, this is a complex setup... > lxc009 > > # grep btrfs /etc/mtab > /dev/sda3 / btrfs rw,relatime,space_cache,subvolid=5,subvol=/ 0 0 > /dev/sdb1 /usr/local/data2 btrfs > rw,noatime,compress-force=zstd,space_cache,subvolid=5,subvol=/ 0 0 > > > > (RAID controller) > > # smartctl -l scterc /dev/sdb > smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local > build) > Copyright (C) 2002-16, Bruce Allen, Christian Franke, > www.smartmontools.org > > > > # btrfs dev stats /usr/local/data2 > [/dev/sdb1].write_io_errs 0 > [/dev/sdb1].read_io_errs 0 > [/dev/sdb1].flush_io_errs 0 > [/dev/sdb1].corruption_errs 0 > [/dev/sdb1].generation_errs 0 > Ok, first decide where you want to reflink/snapshot, either inside the VMs or outside. If inside: * Host your VM images on an ext4 or xfs partition. * Use btrfs inside the VM. - Use noatime inside the VM. * To get backups onto the host, use the network or a passed-through partition. If outside: * Host raw VM images on btrfs (noatime). * Use ext4 on xfs inside the VM. - Everything is already COWed, checksummed, and compressed on the VM host, so it's absolutely not needed here. - Also use noatime inside the VM. * Periodically defrag the live copy of your VM images. ! Note that many on the linux-btrfs mailing list do not recommend btrfs for this type of workload, if performance is important. ? Maybe the partition pass through is how you're getting around this issue? Hacky "it's too late to rethink this server": Use chattr +C on the VM images. Note that the images will no longer be checksummed (see wiki). Maybe I've misunderstood, but it looks like you're running btrfs volumes, on top of qcow2 images, on top of a btrfs host volume. That's an easy to reproduce recipe for problems of this kind. Sincerely, Nicholas
signature.asc
Description: PGP signature