[Kernel-packages] [Bug 2075110] Re: md: nvme over tcp with a striped underlying md raid device leads to data corruption

Matthew Ruffell Wed, 07 Aug 2024 22:15:26 -0700

Performing verification for Noble.

I started a n2-standard-2 instance on Google cloud, running Noble.


I installed 6.8.0-39-generic from -updates, rebooted, and followed the 
instructions in the
testcase.

$ sudo ./check md/001
md/001 (Raid with bitmap on tcp nvmet with opt-io-size over bitmap size)

Having a look at dmesg:

unknown: run blktests md/001 at 2024-08-08 04:26:39
root[1982]: run blktests md/001
kernel: brd: module loaded
(udev-worker)[1987]: dm-0: Process '/usr/bin/unshare -m /usr/bin/snap 
auto-import --mount=/dev/dm-0' failed with exit code 1.
kernel: Key type psk registered
kernel: nvmet: adding nsid 1 to subsystem blktests-subsystem-1
kernel: nvmet_tcp: enabling port 0 (127.0.0.1:4420)
kernel: nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for 
NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
kernel: nvme nvme1: creating 2 I/O queues.
kernel: nvme nvme1: mapped 2/0/0 default/read/poll queues.
kernel: nvme nvme1: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, 
hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
(udev-worker)[2018]: nvme1n1: Process '/usr/bin/unshare -m /usr/bin/snap 
auto-import --mount=/dev/nvme1n1' failed with exit code 1.
(udev-worker)[2018]: md127: Process '/usr/bin/unshare -m /usr/bin/snap 
auto-import --mount=/dev/md127' failed with exit code 1.
kernel: md/raid1:md127: active with 1 out of 2 mirrors
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 0 PID: 50 at net/core/skbuff.c:6995 
skb_splice_from_iter+0x139/0x370
kernel: Modules linked in: nvme_tcp nvmet_tcp nvmet nvme_keyring brd raid1 
cfg80211 8021q garp mrp stp llc binfmt_misc nls_iso8859_1 intel_rapl_msr 
intel_rapl_common intel_uncore_frequency_common isst_if_common nfit 
crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic 
ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd rapl 
pvpanic_mmio pvpanic nvme psmouse i2c_piix4 input_leds mac_hid serio_raw 
dm_multipath nvme_fabrics nvme_core nvme_auth efi_pstore nfnetlink dmi_sysfs 
virtio_rng ip_tables x_tables autofs4
kernel: CPU: 0 PID: 50 Comm: kworker/0:1H Not tainted 6.8.0-39-generic 
#39-Ubuntu
kernel: Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
Google 06/27/2024
kernel: Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
kernel: RIP: 0010:skb_splice_from_iter+0x139/0x370
kernel: Code: 39 e1 48 8b 53 08 49 0f 47 cc 49 89 cd f6 c2 01 0f 85 c0 01 00 00 
66 90 48 89 da 48 8b 12 80 e6 08 0f 84 8e 00 00 00 4d 89 fe <0f> 0b 49 c7 c0 fb 
ff ff ff 48 8b 85 68 ff ff ff 41 01 46 70 41 01
kernel: RSP: 0018:ffffbd92001b3a30 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: fffff5f1c48d9b40 RCX: 0000000000001000
kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
kernel: RBP: ffffbd92001b3ad8 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000020e8
kernel: R13: 0000000000001000 R14: ffff96834b496400 R15: ffff96834b496400
kernel: FS:  0000000000000000(0000) GS:ffff968477c00000(0000) 
knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007507bcfe5f84 CR3: 000000010b49c002 CR4: 00000000003706f0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel:  <TASK>
kernel:  ? show_regs+0x6d/0x80
kernel:  ? __warn+0x89/0x160
kernel:  ? skb_splice_from_iter+0x139/0x370
kernel:  ? report_bug+0x17e/0x1b0
kernel:  ? handle_bug+0x51/0xa0
kernel:  ? exc_invalid_op+0x18/0x80
kernel:  ? asm_exc_invalid_op+0x1b/0x20
kernel:  ? skb_splice_from_iter+0x139/0x370
kernel:  tcp_sendmsg_locked+0x352/0xd70
kernel:  ? tcp_push+0x159/0x190
kernel:  ? tcp_sendmsg_locked+0x9c4/0xd70
kernel:  tcp_sendmsg+0x2c/0x50
kernel:  inet_sendmsg+0x42/0x80
kernel:  sock_sendmsg+0x118/0x150
kernel:  nvme_tcp_try_send_data+0x18b/0x4c0 [nvme_tcp]
kernel:  nvme_tcp_try_send+0x23c/0x300 [nvme_tcp]
kernel:  nvme_tcp_io_work+0x40/0xe0 [nvme_tcp]
kernel:  process_one_work+0x16c/0x350
kernel:  worker_thread+0x306/0x440
kernel:  ? _raw_spin_unlock_irqrestore+0x11/0x60
kernel:  ? __pfx_worker_thread+0x10/0x10
kernel:  kthread+0xef/0x120
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork+0x44/0x70
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork_asm+0x1b/0x30
kernel:  </TASK>
kernel: ---[ end trace 0000000000000000 ]---
kernel: nvme nvme1: failed to send request -5
kernel: nvme nvme1: I/O tag 111 (106f) type 4 opcode 0x0 (I/O Cmd) QID 1 timeout
kernel: nvme nvme1: starting error recovery
kernel: block nvme1n1: no usable path - requeuing I/O
kernel: nvme nvme1: Reconnecting in 10 seconds...

blktests md/001 hangs the system, in this particular scenario.

I then restarted the instance, enabled -proposed2, and installed
6.8.0-41-generic:

6.8.0-41-generic #41-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug  2 20:41:06 UTC
2024

I can now run md/001 multiple times, and it passes within a second each time.
The hang now longer occurs.

$ sudo ./check md/001
md/001 (Raid with bitmap on tcp nvmet with opt-io-size over bitmap size) 
[passed]
    runtime    ...  0.441s
$ sudo ./check md/001
md/001 (Raid with bitmap on tcp nvmet with opt-io-size over bitmap size) 
[passed]
    runtime  0.441s  ...  0.405s
$ sudo ./check md/001
md/001 (Raid with bitmap on tcp nvmet with opt-io-size over bitmap size) 
[passed]
    runtime  0.405s  ...  0.410s
$ sudo ./check md/001
md/001 (Raid with bitmap on tcp nvmet with opt-io-size over bitmap size) 
[passed]
    runtime  0.410s  ...  0.429s
$ sudo ./check md/001
md/001 (Raid with bitmap on tcp nvmet with opt-io-size over bitmap size) 
[passed]
    runtime  0.429s  ...  0.408s
    
dmesg has:

unknown: run blktests md/001 at 2024-08-08 05:02:40
root[2377]: run blktests md/001
kernel: brd: module loaded
kernel: nvmet: adding nsid 1 to subsystem blktests-subsystem-1
kernel: nvmet_tcp: enabling port 0 (127.0.0.1:4420)
kernel: nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for 
NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
kernel: nvme nvme1: creating 2 I/O queues.
kernel: nvme nvme1: mapped 2/0/0 default/read/poll queues.
kernel: nvme nvme1: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, 
hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
kernel: md/raid1:md127: active with 1 out of 2 mirrors
kernel: md127: detected capacity change from 0 to 2093056
kernel: md127: detected capacity change from 2093056 to 0
kernel: md: md127 stopped.
kernel: nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1"
multipathd[190]: nvme1n1: path already removed
kernel: brd: module unloaded
sudo[2342]: pam_unix(sudo:session): session closed for user root

The 6.8.0-41-generic kernel in -proposed2 fixes the issue. Happy to mark
verified for Noble.

** Tags removed: verification-needed-noble-linux
** Tags added: verification-done-noble-linux

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2075110

Title:
  md: nvme over tcp with a striped underlying md raid device leads to
  data corruption

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Noble:
  Fix Committed

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/2075110

  [Impact]

  There is a fault in the md subsystem where __write_sb_page() will
  round the io size up to the optimal size, but it doesn't check to see
  if the final io size exceeds the bitmap length.

  This gets us into a situation where if we have 256K of io to submit,
  64 pages are needed. md_bitmap_storage_alloc() allocates 1 page, and
  63 are allocated afterward.

  When we send md writes over the network, e.g. with nvme over tcp, the
  network subsystem checks the first page which is sendpage_ok(), but
  not the other 63, which might not be sendpage_ok(), and will get
  stuck, causing a hang and data corruption.

  If you trigger the issue, you get the following oops in dmesg:

  WARNING: CPU: 0 PID: 83 at net/core/skbuff.c:6995 
skb_splice_from_iter+0x139/0x370
  CPU: 0 PID: 83 Comm: kworker/0:1H Not tainted 6.8.0-39-generic #39-Ubuntu
  Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
  RIP: 0010:skb_splice_from_iter+0x139/0x370
  CR2: 000072dab83e5f84
  Call Trace:
   <TASK>
   ? show_regs+0x6d/0x80
   ? __warn+0x89/0x160
   ? skb_splice_from_iter+0x139/0x370
   ? report_bug+0x17e/0x1b0
   ? handle_bug+0x51/0xa0
   ? exc_invalid_op+0x18/0x80
   ? asm_exc_invalid_op+0x1b/0x20
   ? skb_splice_from_iter+0x139/0x370
   tcp_sendmsg_locked+0x352/0xd70
   ? tcp_push+0x159/0x190
   ? tcp_sendmsg_locked+0x9c4/0xd70
   tcp_sendmsg+0x2c/0x50
   inet_sendmsg+0x42/0x80
   sock_sendmsg+0x118/0x150
   nvme_tcp_try_send_data+0x18b/0x4c0 [nvme_tcp]
   ? __tcp_cleanup_rbuf+0xc5/0xe0
   nvme_tcp_try_send+0x23c/0x300 [nvme_tcp]
   nvme_tcp_io_work+0x40/0xe0 [nvme_tcp]
   process_one_work+0x16c/0x350
   worker_thread+0x306/0x440
   ? _raw_spin_unlock_irqrestore+0x11/0x60
   ? __pfx_worker_thread+0x10/0x10
   kthread+0xef/0x120
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x44/0x70
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1b/0x30
   </TASK>
  nvme nvme1: failed to send request -5
  nvme nvme1: I/O tag 125 (307d) type 4 opcode 0x0 (I/O Cmd) QID 1 timeout
  nvme nvme1: starting error recovery
  block nvme1n1: no usable path - requeuing I/O
  nvme nvme1: Reconnecting in 10 seconds...

  There is no workaround.

  [Fix]

  This was fixed in the below commit in 6.11-rc1:

  commit ab99a87542f194f28e2364a42afbf9fb48b1c724
  Author: Ofir Gal <ofir....@volumez.com>
  Date:  Fri Jun 7 10:27:44 2024 +0300
  Subject: md/md-bitmap: fix writing non bitmap pages
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab99a87542f194f28e2364a42afbf9fb48b1c724

  This is a clean cherry-pick to the Noble tree.

  [Testcase]

  This can be reproduced by running blktests md/001 [1], which the
  author of the fix created to act as a regression test for this issue.

  [1]
  
https://github.com/osandov/blktests/commit/a24a7b462816fbad7dc6c175e53fcc764ad0a822

  Deploy a fresh Noble VM, that has a scratch NVME disk.

  $ sudo apt install build-essential fio
  $ git clone https://github.com/osandov/blktests.git
  $ cd blktests
  $ make
  $ echo "TEST_DEVS=(/dev/nvme0n1)" > config
  $ sudo ./check md/001

  The md/001 test will hang an affected system, and the above oops
  message will be visible in dmesg.

  A test kernel is available in the following ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf390669-test

  If you install the test kernel, the md/001 test will complete
  successfully, and the issue will no longer appear.

  [Where problems could occur]

  We are changing how the md subsystem calculates final IO sizes, and
  taking the smaller value of the size or the bitmap_limit. This makes
  sure we don't leak the final page and corrupt data.

  If a regression were to occur, it would likely affect all md users,
  but would be more obvious to md users over the network, like nvme over
  tcp.

  There is no workaround. Users would have to downgrade their kernels if
  a regression occurs.

  [Other info]

  I checked Jammy 5.15 and it works fine, so the issue must have been
  introduced later on. It is not needed for Focal or Jammy.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2075110/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2075110] Re: md: nvme over tcp with a striped underlying md raid device leads to data corruption

Reply via email to