[Kernel-packages] [Bug 2075110] [NEW] md: nvme over tcp with a striped underlying md raid device leads to data corruption

Matthew Ruffell Tue, 30 Jul 2024 07:57:10 -0700

Public bug reported:

BugLink: https://bugs.launchpad.net/bugs/2075110


[Impact]

There is a fault in the md subsystem where __write_sb_page() will round
the io size up to the optimal size, but it doesn't check to see if the
final io size exceeds the bitmap length.

This gets us into a situation where if we have 256K of io to submit, 64
pages are needed. md_bitmap_storage_alloc() allocates 1 page, and 63 are
allocated afterward.

When we send md writes over the network, e.g. with nvme over tcp, the
network subsystem checks the first page which is sendpage_ok(), but not
the other 63, which might not be sendpage_ok(), and will get stuck,
causing a hang and data corruption.

If you trigger the issue, you get the following oops in dmesg:

WARNING: CPU: 0 PID: 83 at net/core/skbuff.c:6995 
skb_splice_from_iter+0x139/0x370
CPU: 0 PID: 83 Comm: kworker/0:1H Not tainted 6.8.0-39-generic #39-Ubuntu
Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
RIP: 0010:skb_splice_from_iter+0x139/0x370
CR2: 000072dab83e5f84
Call Trace:
 <TASK>
 ? show_regs+0x6d/0x80
 ? __warn+0x89/0x160
 ? skb_splice_from_iter+0x139/0x370
 ? report_bug+0x17e/0x1b0
 ? handle_bug+0x51/0xa0
 ? exc_invalid_op+0x18/0x80
 ? asm_exc_invalid_op+0x1b/0x20
 ? skb_splice_from_iter+0x139/0x370
 tcp_sendmsg_locked+0x352/0xd70
 ? tcp_push+0x159/0x190
 ? tcp_sendmsg_locked+0x9c4/0xd70
 tcp_sendmsg+0x2c/0x50
 inet_sendmsg+0x42/0x80
 sock_sendmsg+0x118/0x150
 nvme_tcp_try_send_data+0x18b/0x4c0 [nvme_tcp]
 ? __tcp_cleanup_rbuf+0xc5/0xe0
 nvme_tcp_try_send+0x23c/0x300 [nvme_tcp]
 nvme_tcp_io_work+0x40/0xe0 [nvme_tcp]
 process_one_work+0x16c/0x350
 worker_thread+0x306/0x440
 ? _raw_spin_unlock_irqrestore+0x11/0x60
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xef/0x120
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x44/0x70
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>
nvme nvme1: failed to send request -5
nvme nvme1: I/O tag 125 (307d) type 4 opcode 0x0 (I/O Cmd) QID 1 timeout
nvme nvme1: starting error recovery
block nvme1n1: no usable path - requeuing I/O
nvme nvme1: Reconnecting in 10 seconds...

There is no workaround.

[Fix]

This was fixed in the below commit in 6.11-rc1:

commit ab99a87542f194f28e2364a42afbf9fb48b1c724
Author: Ofir Gal <ofir....@volumez.com>
Date:  Fri Jun 7 10:27:44 2024 +0300
Subject: md/md-bitmap: fix writing non bitmap pages
Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab99a87542f194f28e2364a42afbf9fb48b1c724

This is a clean cherry-pick to the Noble tree.

[Testcase]

This can be reproduced by running blktests md/001 [1], which the author
of the fix created to act as a regression test for this issue.

[1]
https://github.com/osandov/blktests/commit/a24a7b462816fbad7dc6c175e53fcc764ad0a822

Deploy a fresh Noble VM, that has a scratch NVME disk.

$ sudo apt install build-essential fio
$ git clone https://github.com/osandov/blktests.git
$ cd blktests
$ make
$ echo "TEST_DEVS=(/dev/nvme0n1)" > config
$ sudo ./check md/001

The md/001 test will hang an affected system, and the above oops message
will be visible in dmesg.

A test kernel is available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf390669-test

If you install the test kernel, the md/001 test will complete
successfully, and the issue will no longer appear.

[Where problems could occur]

We are changing how the md subsystem calculates final IO sizes, and
taking the smaller value of the size or the bitmap_limit. This makes
sure we don't leak the final page and corrupt data.

If a regression were to occur, it would likely affect all md users, but
would be more obvious to md users over the network, like nvme over tcp.

There is no workaround. Users would have to downgrade their kernels if a
regression occurs.

[Other info]

I checked Jammy 5.15 and it works fine, so the issue must have been
introduced later on. It is not needed for Focal or Jammy.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: Fix Released

** Affects: linux (Ubuntu Noble)
     Importance: Medium
     Assignee: Matthew Ruffell (mruffell)
         Status: In Progress


** Tags: noble sts

** Also affects: linux (Ubuntu Noble)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu)
       Status: New => Fix Released

** Changed in: linux (Ubuntu Noble)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Noble)
       Status: New => In Progress

** Changed in: linux (Ubuntu Noble)
     Assignee: (unassigned) => Matthew Ruffell (mruffell)

** Description changed:

- BugLink: https://bugs.launchpad.net/bugs/
+ BugLink: https://bugs.launchpad.net/bugs/2075110
  
  [Impact]
  
  There is a fault in the md subsystem where __write_sb_page() will round
  the io size up to the optimal size, but it doesn't check to see if the
  final io size exceeds the bitmap length.
  
  This gets us into a situation where if we have 256K of io to submit, 64
  pages are needed. md_bitmap_storage_alloc() allocates 1 page, and 63 are
  allocated afterward.
  
  When we send md writes over the network, e.g. with nvme over tcp, the
  network subsystem checks the first page which is sendpage_ok(), but not
  the other 63, which might not be sendpage_ok(), and will get stuck,
  causing a hang and data corruption.
  
  If you trigger the issue, you get the following oops in dmesg:
  
  WARNING: CPU: 0 PID: 83 at net/core/skbuff.c:6995 
skb_splice_from_iter+0x139/0x370
  CPU: 0 PID: 83 Comm: kworker/0:1H Not tainted 6.8.0-39-generic #39-Ubuntu
  Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
  RIP: 0010:skb_splice_from_iter+0x139/0x370
  CR2: 000072dab83e5f84
  Call Trace:
-  <TASK>
-  ? show_regs+0x6d/0x80
-  ? __warn+0x89/0x160
-  ? skb_splice_from_iter+0x139/0x370
-  ? report_bug+0x17e/0x1b0
-  ? handle_bug+0x51/0xa0
-  ? exc_invalid_op+0x18/0x80
-  ? asm_exc_invalid_op+0x1b/0x20
-  ? skb_splice_from_iter+0x139/0x370
-  tcp_sendmsg_locked+0x352/0xd70
-  ? tcp_push+0x159/0x190
-  ? tcp_sendmsg_locked+0x9c4/0xd70
-  tcp_sendmsg+0x2c/0x50
-  inet_sendmsg+0x42/0x80
-  sock_sendmsg+0x118/0x150
-  nvme_tcp_try_send_data+0x18b/0x4c0 [nvme_tcp]
-  ? __tcp_cleanup_rbuf+0xc5/0xe0
-  nvme_tcp_try_send+0x23c/0x300 [nvme_tcp]
-  nvme_tcp_io_work+0x40/0xe0 [nvme_tcp]
-  process_one_work+0x16c/0x350
-  worker_thread+0x306/0x440
-  ? _raw_spin_unlock_irqrestore+0x11/0x60
-  ? __pfx_worker_thread+0x10/0x10
-  kthread+0xef/0x120
-  ? __pfx_kthread+0x10/0x10
-  ret_from_fork+0x44/0x70
-  ? __pfx_kthread+0x10/0x10
-  ret_from_fork_asm+0x1b/0x30
-  </TASK>
+  <TASK>
+  ? show_regs+0x6d/0x80
+  ? __warn+0x89/0x160
+  ? skb_splice_from_iter+0x139/0x370
+  ? report_bug+0x17e/0x1b0
+  ? handle_bug+0x51/0xa0
+  ? exc_invalid_op+0x18/0x80
+  ? asm_exc_invalid_op+0x1b/0x20
+  ? skb_splice_from_iter+0x139/0x370
+  tcp_sendmsg_locked+0x352/0xd70
+  ? tcp_push+0x159/0x190
+  ? tcp_sendmsg_locked+0x9c4/0xd70
+  tcp_sendmsg+0x2c/0x50
+  inet_sendmsg+0x42/0x80
+  sock_sendmsg+0x118/0x150
+  nvme_tcp_try_send_data+0x18b/0x4c0 [nvme_tcp]
+  ? __tcp_cleanup_rbuf+0xc5/0xe0
+  nvme_tcp_try_send+0x23c/0x300 [nvme_tcp]
+  nvme_tcp_io_work+0x40/0xe0 [nvme_tcp]
+  process_one_work+0x16c/0x350
+  worker_thread+0x306/0x440
+  ? _raw_spin_unlock_irqrestore+0x11/0x60
+  ? __pfx_worker_thread+0x10/0x10
+  kthread+0xef/0x120
+  ? __pfx_kthread+0x10/0x10
+  ret_from_fork+0x44/0x70
+  ? __pfx_kthread+0x10/0x10
+  ret_from_fork_asm+0x1b/0x30
+  </TASK>
  nvme nvme1: failed to send request -5
  nvme nvme1: I/O tag 125 (307d) type 4 opcode 0x0 (I/O Cmd) QID 1 timeout
  nvme nvme1: starting error recovery
  block nvme1n1: no usable path - requeuing I/O
  nvme nvme1: Reconnecting in 10 seconds...
  
  There is no workaround.
  
  [Fix]
  
  This was fixed in the below commit in 6.11-rc1:
  
  commit ab99a87542f194f28e2364a42afbf9fb48b1c724
  Author: Ofir Gal <ofir....@volumez.com>
  Date:  Fri Jun 7 10:27:44 2024 +0300
  Subject: md/md-bitmap: fix writing non bitmap pages
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab99a87542f194f28e2364a42afbf9fb48b1c724
  
  This is a clean cherry-pick to the Noble tree.
  
  [Testcase]
  
  This can be reproduced by running blktests md/001 [1], which the author
  of the fix created to act as a regression test for this issue.
  
  [1]
  
https://github.com/osandov/blktests/commit/a24a7b462816fbad7dc6c175e53fcc764ad0a822
  
  Deploy a fresh Noble VM, that has a scratch NVME disk.
  
  $ sudo apt install build-essential fio
  $ git clone https://github.com/osandov/blktests.git
  $ cd blktests
  $ make
  $ echo "TEST_DEVS=(/dev/nvme0n1)" > config
  $ sudo ./check md/001
  
  The md/001 test will hang an affected system, and the above oops message
  will be visible in dmesg.
  
  A test kernel is available in the following ppa:
  
  https://launchpad.net/~mruffell/+archive/ubuntu/sf390669-test
  
  If you install the test kernel, the md/001 test will complete
  successfully, and the issue will no longer appear.
  
  [Where problems could occur]
  
  We are changing how the md subsystem calculates final IO sizes, and
  taking the smaller value of the size or the bitmap_limit. This makes
  sure we don't leak the final page and corrupt data.
  
  If a regression were to occur, it would likely affect all md users, but
  would be more obvious to md users over the network, like nvme over tcp.
  
  There is no workaround. Users would have to downgrade their kernels if a
  regression occurs.
  
  [Other info]
  
  I checked Jammy 5.15 and it works fine, so the issue must have been
  introduced later on. It is not needed for Focal or Jammy.

** Tags added: noble sts

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2075110

Title:
  md: nvme over tcp with a striped underlying md raid device leads to
  data corruption

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Noble:
  In Progress

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/2075110

  [Impact]

  There is a fault in the md subsystem where __write_sb_page() will
  round the io size up to the optimal size, but it doesn't check to see
  if the final io size exceeds the bitmap length.

  This gets us into a situation where if we have 256K of io to submit,
  64 pages are needed. md_bitmap_storage_alloc() allocates 1 page, and
  63 are allocated afterward.

  When we send md writes over the network, e.g. with nvme over tcp, the
  network subsystem checks the first page which is sendpage_ok(), but
  not the other 63, which might not be sendpage_ok(), and will get
  stuck, causing a hang and data corruption.

  If you trigger the issue, you get the following oops in dmesg:

  WARNING: CPU: 0 PID: 83 at net/core/skbuff.c:6995 
skb_splice_from_iter+0x139/0x370
  CPU: 0 PID: 83 Comm: kworker/0:1H Not tainted 6.8.0-39-generic #39-Ubuntu
  Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
  RIP: 0010:skb_splice_from_iter+0x139/0x370
  CR2: 000072dab83e5f84
  Call Trace:
   <TASK>
   ? show_regs+0x6d/0x80
   ? __warn+0x89/0x160
   ? skb_splice_from_iter+0x139/0x370
   ? report_bug+0x17e/0x1b0
   ? handle_bug+0x51/0xa0
   ? exc_invalid_op+0x18/0x80
   ? asm_exc_invalid_op+0x1b/0x20
   ? skb_splice_from_iter+0x139/0x370
   tcp_sendmsg_locked+0x352/0xd70
   ? tcp_push+0x159/0x190
   ? tcp_sendmsg_locked+0x9c4/0xd70
   tcp_sendmsg+0x2c/0x50
   inet_sendmsg+0x42/0x80
   sock_sendmsg+0x118/0x150
   nvme_tcp_try_send_data+0x18b/0x4c0 [nvme_tcp]
   ? __tcp_cleanup_rbuf+0xc5/0xe0
   nvme_tcp_try_send+0x23c/0x300 [nvme_tcp]
   nvme_tcp_io_work+0x40/0xe0 [nvme_tcp]
   process_one_work+0x16c/0x350
   worker_thread+0x306/0x440
   ? _raw_spin_unlock_irqrestore+0x11/0x60
   ? __pfx_worker_thread+0x10/0x10
   kthread+0xef/0x120
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x44/0x70
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1b/0x30
   </TASK>
  nvme nvme1: failed to send request -5
  nvme nvme1: I/O tag 125 (307d) type 4 opcode 0x0 (I/O Cmd) QID 1 timeout
  nvme nvme1: starting error recovery
  block nvme1n1: no usable path - requeuing I/O
  nvme nvme1: Reconnecting in 10 seconds...

  There is no workaround.

  [Fix]

  This was fixed in the below commit in 6.11-rc1:

  commit ab99a87542f194f28e2364a42afbf9fb48b1c724
  Author: Ofir Gal <ofir....@volumez.com>
  Date:  Fri Jun 7 10:27:44 2024 +0300
  Subject: md/md-bitmap: fix writing non bitmap pages
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab99a87542f194f28e2364a42afbf9fb48b1c724

  This is a clean cherry-pick to the Noble tree.

  [Testcase]

  This can be reproduced by running blktests md/001 [1], which the
  author of the fix created to act as a regression test for this issue.

  [1]
  
https://github.com/osandov/blktests/commit/a24a7b462816fbad7dc6c175e53fcc764ad0a822

  Deploy a fresh Noble VM, that has a scratch NVME disk.

  $ sudo apt install build-essential fio
  $ git clone https://github.com/osandov/blktests.git
  $ cd blktests
  $ make
  $ echo "TEST_DEVS=(/dev/nvme0n1)" > config
  $ sudo ./check md/001

  The md/001 test will hang an affected system, and the above oops
  message will be visible in dmesg.

  A test kernel is available in the following ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf390669-test

  If you install the test kernel, the md/001 test will complete
  successfully, and the issue will no longer appear.

  [Where problems could occur]

  We are changing how the md subsystem calculates final IO sizes, and
  taking the smaller value of the size or the bitmap_limit. This makes
  sure we don't leak the final page and corrupt data.

  If a regression were to occur, it would likely affect all md users,
  but would be more obvious to md users over the network, like nvme over
  tcp.

  There is no workaround. Users would have to downgrade their kernels if
  a regression occurs.

  [Other info]

  I checked Jammy 5.15 and it works fine, so the issue must have been
  introduced later on. It is not needed for Focal or Jammy.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2075110/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2075110] [NEW] md: nvme over tcp with a striped underlying md raid device leads to data corruption

Reply via email to