I/O Error Test 6 (for the Cosmic kernel) ================ commit: 'Revert "bcache: set CACHE_SET_IO_DISABLE in bch_cached_dev_error()"'
Problem: if one backing device hits I/O errors the cache device is disabled, but if that cache device is shared by other bcache devices they stop too (even with non-failing backing devices). Original kernel: all bcache devices that share cache device with failing backing device are stopped. Modified kernel: only the bcache device with the failing backing device is stopped. Original kernel --------------- root@guest-bcache:~# uname -rv 4.18.0-23-generic #24-Ubuntu SMP Wed Jun 12 18:17:39 UTC 2019 root@guest-bcache:~# lsblk -e 252 root@guest-bcache:~# root@guest-bcache:~# ./setup-two-bcache-one-cache.sh >/dev/null 2>&1 [ 35.686002] bcache: register_bdev() registered backing device dm-0 [ 35.695980] bcache: register_bdev() registered backing device dm-1 [ 35.704662] bcache: run_cache_set() invalidating existing data [ 35.719046] bcache: register_cache() registered cache device dm-2 [ 36.705686] bcache: bch_cached_dev_attach() Caching dm-0 as bcache0 on set fce8d558-4657-47dc-ab37-226ada14daf5 [ 36.711827] bcache: bch_cached_dev_attach() Caching dm-1 as bcache1 on set fce8d558-4657-47dc-ab37-226ada14daf5 root@guest-bcache:~# lsblk -e 252 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1G 0 loop └─fake-loop0 253:0 0 1024M 0 dm └─bcache0 251:0 0 1024M 0 disk loop1 7:1 0 1G 0 loop └─fake-loop1 253:1 0 1024M 0 dm └─bcache1 251:128 0 1024M 0 disk loop2 7:2 0 1G 0 loop └─fake-loop2 253:2 0 1024M 0 dm ├─bcache0 251:0 0 1024M 0 disk └─bcache1 251:128 0 1024M 0 disk root@guest-bcache:~# echo writeback | tee /sys/block/dm-*/bcache/cache_mode writeback root@guest-bcache:~# cat /sys/block/dm-*/bcache/cache_mode writethrough [writeback] writearound none writethrough [writeback] writearound none root@guest-bcache:~# ./dm_fake_dev.sh /dev/loop0 bad [ 76.875749] Buffer I/O error on dev dm-0, logical block 262128, async page read [ 76.882159] Buffer I/O error on dev dm-0, logical block 262128, async page read [ 76.889453] bcache: register_bcache() error /dev/dm-0: device already registered (emitting change event) [ 76.892183] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 76.904907] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 76.907711] Buffer I/O error on dev bcache0, logical block 262112, async page read [ 76.912607] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 76.916905] Buffer I/O error on dev bcache0, logical block 262112, async page read [ 76.920345] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 76.924767] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 76.928404] Buffer I/O error on dev bcache0, logical block 1, async page read root@guest-bcache:~# dd if=/dev/zero of=/dev/bcache1 bs=4k & dd if=/dev/zero of=/dev/bcache0 bs=4k & [ 175.024811] Buffer I/O error on dev bcache0, logical block 0, lost async page write [ 175.029844] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 175.034652] Buffer I/O error on dev bcache0, logical block 1, lost async page write [ 175.037465] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 175.040373] Buffer I/O error on dev bcache0, logical block 2, lost async page write ... [ 175.092196] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 175.096635] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 175.101272] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 175.105829] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable ... [ 175.235700] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 175.239457] bcache: bch_cached_dev_error() stop bcache0: too many IO errors on backing device dm-0 [ 175.239457] [ 175.324069] bcache: bch_cache_set_error() CACHE_SET_IO_DISABLE already set [ 175.328998] bcache: error on fce8d558-4657-47dc-ab37-226ada14daf5: [ 175.328999] journal io error [ 175.331022] , disabling caching [ 175.334264] bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache0 is "auto" and cache is dirty, stop it to avoid potential data corruption. [ 175.338865] bcache: conditional_stop_bcache_device() stop_when_cache_set_failed of bcache1 is "auto" and cache is dirty, stop it to avoid potential data corruption. [ 175.344097] bcache: cached_dev_detach_finish() Caching disabled for dm-1 [ 176.080139] bcache: bcache_device_free() bcache0 stopped [ 176.083928] bcache: bch_count_io_errors() dm-2: IO error on writing btree. [ 176.188371] bcache: cache_set_free() Cache set fce8d558-4657-47dc-ab37-226ada14daf5 unregistered [ 176.841497] bcache: bcache_device_free() bcache1 stopped dd: error writing '/dev/bcache0': No space left on device 262142+0 records in 262141+0 records out 1073729536 bytes (1.1 GB, 1.0 GiB) copied, 1.81834 s, 591 MB/s dd: error writing '/dev/bcache1': No space left on device 262142+0 records in 262141+0 records out 1073729536 bytes (1.1 GB, 1.0 GiB) copied, 2.5749 s, 417 MB/s [1]- Exit 1 dd if=/dev/zero of=/dev/bcache1 bs=4k [2]+ Exit 1 dd if=/dev/zero of=/dev/bcache0 bs=4k root@guest-bcache:~# lsblk -e 252 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1G 0 loop loop1 7:1 0 1G 0 loop └─fake-loop1 253:1 0 1024M 0 dm loop2 7:2 0 1G 0 loop └─fake-loop2 253:2 0 1024M 0 dm fake-loop0 253:0 0 1G 0 dm Notice that bcache0 and bcache1 are missing. Modified kernel --------------- root@guest-bcache:~# uname -rv 4.18.0-23-generic #24+test20190627b1 SMP Thu Jun 27 13:29:22 UTC 2019 root@guest-bcache:~# lsblk -e 252 root@guest-bcache:~# root@guest-bcache:~# ./setup-two-bcache-one-cache.sh >/dev/null 2>&1 [ 146.600391] bcache: register_bdev() registered backing device dm-0 [ 146.608618] bcache: register_bdev() registered backing device dm-1 [ 146.617808] bcache: run_cache_set() invalidating existing data [ 146.632355] bcache: register_cache() registered cache device dm-2 [ 147.615003] bcache: bch_cached_dev_attach() Caching dm-0 as bcache0 on set 6673bcb3-7a64-4675-a82f-59bb66886d66 [ 147.633610] bcache: bch_cached_dev_attach() Caching dm-1 as bcache1 on set 6673bcb3-7a64-4675-a82f-59bb66886d66 root@guest-bcache:~# lsblk -e 252 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1G 0 loop └─fake-loop0 253:0 0 1024M 0 dm └─bcache0 251:0 0 1024M 0 disk loop1 7:1 0 1G 0 loop └─fake-loop1 253:1 0 1024M 0 dm └─bcache1 251:128 0 1024M 0 disk loop2 7:2 0 1G 0 loop └─fake-loop2 253:2 0 1024M 0 dm ├─bcache0 251:0 0 1024M 0 disk └─bcache1 251:128 0 1024M 0 disk root@guest-bcache:~# echo writeback | tee /sys/block/dm-*/bcache/cache_mode writeback root@guest-bcache:~# cat /sys/block/dm-*/bcache/cache_mode writethrough [writeback] writearound none writethrough [writeback] writearound none root@guest-bcache:~# ./dm_fake_dev.sh /dev/loop0 bad [ 174.138534] Buffer I/O error on dev dm-0, logical block 262128, async page read [ 174.145142] Buffer I/O error on dev dm-0, logical block 262128, async page read [ 174.152728] bcache: register_bcache() error /dev/dm-0: device already registered (emitting change event) [ 174.154780] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 174.159945] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 174.162933] Buffer I/O error on dev bcache0, logical block 262112, async page read [ 174.168696] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 174.172368] Buffer I/O error on dev bcache0, logical block 262112, async page read [ 174.175272] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 174.178593] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 174.181896] Buffer I/O error on dev bcache0, logical block 1, async page read root@guest-bcache:~# dd if=/dev/zero of=/dev/bcache1 bs=4k & dd if=/dev/zero of=/dev/bcache0 bs=4k &s [1] 1377 [2] 1378 [ 183.348428] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 183.354587] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 183.360488] Buffer I/O error on dev bcache0, logical block 0, lost async page write [ 183.364666] Buffer I/O error on dev bcache0, logical block 1, lost async page write [ 183.368326] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable ... [ 183.430652] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 183.434399] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 183.438198] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable [ 183.441991] bcache: bch_count_backing_io_errors() dm-0: IO error on backing device, unrecoverable ... [ 183.635500] bcache: bch_cached_dev_error() stop bcache0: too many IO errors on backing device dm-0 [ 183.635500] [ 184.840023] bcache: bcache_device_free() bcache0 stopped dd: error writing '/dev/bcache0': No space left on device dd: error writing '/dev/bcache1': No space left on device 262142+0 records in 262141+0 records out 1073729536 bytes (1.1 GB, 1.0 GiB) copied, 2.18238 s, 492 MB/s 262142+0 records in 262141+0 records out 1073729536 bytes (1.1 GB, 1.0 GiB) copied, 3.69895 s, 290 MB/s [1]- Exit 1 dd if=/dev/zero of=/dev/bcache1 bs=4k [2]+ Exit 1 dd if=/dev/zero of=/dev/bcache0 bs=4k root@guest-bcache:~# lsblk -e 252 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1G 0 loop loop1 7:1 0 1G 0 loop └─fake-loop1 253:1 0 1024M 0 dm └─bcache1 251:128 0 1024M 0 disk loop2 7:2 0 1G 0 loop └─fake-loop2 253:2 0 1024M 0 dm └─bcache1 251:128 0 1024M 0 disk fake-loop0 253:0 0 1G 0 dm Notice that only bcache0 is stopped, bcache1 is still present. And after reboot, the bcache devices are reattached. root@guest-bcache:~# dd if=/dev/zero of=/dev/bcache1 bs=4k dd: error writing '/dev/bcache1': No space left on device 262142+0 records in 262141+0 records out 1073729536 bytes (1.1 GB, 1.0 GiB) copied, 4.79076 s, 224 MB/s root@guest-bcache:~# root@guest-bcache:~# reboot root@guest-bcache:~# ./setup-two-bcache-one-cache.reboot.sh [ 104.421020] bcache: register_bdev() registered backing device dm-0 [ 104.492000] bcache: register_bdev() registered backing device dm-1 [ 104.685632] bcache: bch_journal_replay() journal replay done, 97526 keys in 57 entries, seq 359 [ 104.695263] bcache: bch_cached_dev_attach() Caching dm-1 as bcache1 on set 6673bcb3-7a64-4675-a82f-59bb66886d66 [ 104.704708] bcache: bch_cached_dev_attach() Caching dm-0 as bcache0 on set 6673bcb3-7a64-4675-a82f-59bb66886d66 [ 104.709640] bcache: register_cache() registered cache device dm-2 root@guest-bcache:~# lsblk -e 252 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1G 0 loop └─fake-loop0 253:0 0 1024M 0 dm └─bcache0 251:0 0 1024M 0 disk loop1 7:1 0 1G 0 loop └─fake-loop1 253:1 0 1024M 0 dm └─bcache1 251:128 0 1024M 0 disk loop2 7:2 0 1G 0 loop └─fake-loop2 253:2 0 1024M 0 dm ├─bcache0 251:0 0 1024M 0 disk └─bcache1 251:128 0 1024M 0 disk -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1829563 Title: bcache: risk of data loss on I/O errors in backing or caching devices Status in linux package in Ubuntu: Invalid Status in linux source package in Bionic: In Progress Status in linux source package in Cosmic: In Progress Bug description: [Impact] * The bcache code in Bionic lacks several fixes to handle I/O errors in both backing devices and caching devices. * Partial or permanent errors in backing or caching devices, specially in writeback mode, can lead to data loss and/or the application is not notified about failed I/O requests. * The bcache device might remain available for I/O requests even if backing device is offline, so writes are undefined. [Test Case] * Detailed test cases/steps for the behavior of many patches with code logic changes are provided in bug comments. * The patchset has been tested for regressions on each cache mode (writethrough, writeback, writearound, none) with the xfstests test suite (on ext4) and fio (sequential + random read-write). [Regression Potential] * The patchset is relatively large and touches several areas in bcache code, however, synthetic testing of the patches has been performed, and extensive regression/stress tests were run (as mentioned in Test Case section). * Many patches in the patchset are 'Fixes' patches to other patches, and no further 'Fixes' currently exist upstream. [Other Info] * Canonical Field Eng. deploys bcache+writeback extensively (e.g., BootStack, UA cloud, except rare all-flash cases). [Original Bug Description] This is a request for a backport of the following upstream patch from 4.18: "bcache: stop bcache device when backing device is offline" https://github.com/torvalds/linux/commit/0f0709e6bfc3ce4e8e1c0e8573490c45f76cfeee Field engineering uses bcache quite extensively and it would be good to have this in the GA/bionic kernel. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1829563/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp