[Kernel-packages] [Bug 1896578] [NEW] raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

Matthew Ruffell Tue, 22 Sep 2020 00:11:12 -0700

Public bug reported:

BugLink: https://bugs.launchpad.net/bugs/1896578


[Impact]

Block discard is very slow on Raid10, which causes common use cases
which invoke block discard, such as mkfs and fstrim operations, to take
a very long time.

For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe
devices which support block discard, a mkfs.xfs operation on Raid 10
takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid
0, takes 4 seconds.

The bigger the devices, the longer it takes.

The cause is that Raid10 currently uses a 512k chunk size, and uses this
for the discard_max_bytes value. If we need to discard 1.9TB, the kernel
splits the request into millions of 512k bio requests, even if the
underlying device supports larger requests.

For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at
once:

$ cat /sys/block/nvme0n1/queue/discard_max_bytes
2199023255040
$ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
2199023255040

Where the Raid10 md device only supports 512k:

$ cat /sys/block/md0/queue/discard_max_bytes
524288
$ cat /sys/block/md0/queue/discard_max_hw_bytes
524288

If we perform a mkfs.xfs operation on the /dev/md array, it takes over
11 minutes and if we examine the stack, it is stuck in
blkdev_issue_discard()

$ sudo cat /proc/1626/stack
[<0>] wait_barrier+0x14c/0x230 [raid10]
[<0>] regular_request_wait+0x39/0x150 [raid10]
[<0>] raid10_write_request+0x11e/0x850 [raid10]
[<0>] raid10_make_request+0xd7/0x150 [raid10]
[<0>] md_handle_request+0x123/0x1a0
[<0>] md_submit_bio+0xda/0x120
[<0>] __submit_bio_noacct+0xde/0x320
[<0>] submit_bio_noacct+0x4d/0x90
[<0>] submit_bio+0x4f/0x1b0
[<0>] __blkdev_issue_discard+0x154/0x290
[<0>] blkdev_issue_discard+0x5d/0xc0
[<0>] blk_ioctl_discard+0xc4/0x110
[<0>] blkdev_common_ioctl+0x56c/0x840
[<0>] blkdev_ioctl+0xeb/0x270
[<0>] block_ioctl+0x3d/0x50
[<0>] __x64_sys_ioctl+0x91/0xc0
[<0>] do_syscall_64+0x38/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[Fix]

Xiao Ni has developed a patchset which resolves the block discard
performance problems. It is currently in the md-next tree [1], and I am
expecting the commits to be merged during the 5.10 merge window.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h
=md-next

commit 5b2374a6c221f28c74913d208bb5376a7ee3bf70
Author: Xiao Ni <x...@redhat.com>
Date: Wed Sep 2 20:00:23 2020 +0800
Subject: md/raid10: improve discard request for far layout
Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=5b2374a6c221f28c74913d208bb5376a7ee3bf70

commit 8f694215ae4c7abf1e6c985803a1aad0db748d07
Author: Xiao Ni <x...@redhat.com>
Date: Wed Sep 2 20:00:22 2020 +0800
Subject: md/raid10: improve raid10 discard request
Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=8f694215ae4c7abf1e6c985803a1aad0db748d07

commit 6fcfa8732a8cfea7828a9444c855691c481ee557
Author: Xiao Ni <x...@redhat.com>
Date: Tue Aug 25 13:43:01 2020 +0800
Subject: md/raid10: pull codes that wait for blocked dev into one function
Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=6fcfa8732a8cfea7828a9444c855691c481ee557

commit 6f4fed152a5e483af2227156ce7b6263aeeb5c84
Author: Xiao Ni <x...@redhat.com>
Date: Tue Aug 25 13:43:00 2020 +0800
Subject: md/raid10: extend r10bio devs to raid disks
Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=6f4fed152a5e483af2227156ce7b6263aeeb5c84

commit 7197f1a616caf85508d81c7f5c9f065ffaebf027
Author: Xiao Ni <x...@redhat.com>
Date: Tue Aug 25 13:42:59 2020 +0800
Subject: md: add md_submit_discard_bio() for submitting discard bio
Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=7197f1a616caf85508d81c7f5c9f065ffaebf027

It follows a similar strategy which was implemented in Raid0 in the
below commit, which was merged in 4.12-rc2:

commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0
Author: Shaohua Li <s...@fb.com>
Date: Sun May 7 17:36:24 2017 -0700
Subject: md/md0: optimize raid0 discard handling
Link: 
https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0

[Testcase]

You will need a machine with at least 4x NVMe drives which support block
discard. I use a i3.8xlarge instance on AWS, since it has all of these
things.

$ lsblk
xvda    202:0    0    8G  0 disk
└─xvda1 202:1    0    8G  0 part /
nvme0n1 259:2    0  1.7T  0 disk
nvme1n1 259:0    0  1.7T  0 disk
nvme2n1 259:1    0  1.7T  0 disk
nvme3n1 259:3    0  1.7T  0 disk

Create a Raid10 array:

$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4
/dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

Format the array with XFS:

$ time sudo mkfs.xfs /dev/md0
real 11m14.734s

$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk

Optional, do a fstrim:

$ time sudo fstrim /mnt/disk

real    11m37.643s

I built a test kernel based on 5.9-rc6 with the above patches, and we
can see that performance dramatically improves:

$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4
/dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

$ time sudo mkfs.xfs /dev/md0
real    0m4.226s
user    0m0.020s
sys     0m0.148s

$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk
$ time sudo fstrim /mnt/disk

real    0m1.991s
user    0m0.020s
sys     0m0.000s

The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim
from 11 minutes to 2 seconds.

[Regression Potential]

If a regression were to occur, then it would affect operations which
would trigger block discard operations, such as mkfs and fstrim, on
Raid10 only.

Other Raid levels would not be affected, although, I should note there
will be a small risk of regression to Raid0, due to one of its functions
being re-factored and split out, for use in both Raid0 and Raid10.

The changes only affect block discard, so only Raid10 arrays backed by
SSD or NVMe devices which support block discard will be affected.
Traditional hard disks, or SSD devices which do not support block
discard would not be affected.

If a regression were to occur, users could work around the issue by
running "mkfs.xfs -K <device>" which would skip block discard entirely.

** Affects: linux (Ubuntu)
     Importance: Medium
     Assignee: Matthew Ruffell (mruffell)
         Status: In Progress

** Affects: linux (Ubuntu Bionic)
     Importance: Medium
     Assignee: Matthew Ruffell (mruffell)
         Status: In Progress

** Affects: linux (Ubuntu Focal)
     Importance: Medium
     Assignee: Matthew Ruffell (mruffell)
         Status: In Progress

** Affects: linux (Ubuntu Groovy)
     Importance: Medium
     Assignee: Matthew Ruffell (mruffell)
         Status: In Progress


** Tags: sts

** Also affects: linux (Ubuntu Groovy)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Focal)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Bionic)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu Bionic)
       Status: New => In Progress

** Changed in: linux (Ubuntu Focal)
       Status: New => In Progress

** Changed in: linux (Ubuntu Groovy)
       Status: New => In Progress

** Changed in: linux (Ubuntu Bionic)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Focal)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Groovy)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Bionic)
     Assignee: (unassigned) => Matthew Ruffell (mruffell)

** Changed in: linux (Ubuntu Focal)
     Assignee: (unassigned) => Matthew Ruffell (mruffell)

** Changed in: linux (Ubuntu Groovy)
     Assignee: (unassigned) => Matthew Ruffell (mruffell)

** Tags added: sts

** Description changed:

- BugLink: https://bugs.launchpad.net/bugs/
+ BugLink: https://bugs.launchpad.net/bugs/1896578
  
  [Impact]
  
  Block discard is very slow on Raid10, which causes common use cases
  which invoke block discard, such as mkfs and fstrim operations, to take
  a very long time.
  
  For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe
  devices which support block discard, a mkfs.xfs operation on Raid 10
  takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid
  0, takes 4 seconds.
  
  The bigger the devices, the longer it takes.
  
  The cause is that Raid10 currently uses a 512k chunk size, and uses this
  for the discard_max_bytes value. If we need to discard 1.9TB, the kernel
  splits the request into millions of 512k bio requests, even if the
  underlying device supports larger requests.
  
  For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at
  once:
  
  $ cat /sys/block/nvme0n1/queue/discard_max_bytes
  2199023255040
  $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
- 2199023255040 
+ 2199023255040
  
  Where the Raid10 md device only supports 512k:
  
  $ cat /sys/block/md0/queue/discard_max_bytes
  524288
  $ cat /sys/block/md0/queue/discard_max_hw_bytes
- 524288 
+ 524288
  
  If we perform a mkfs.xfs operation on the /dev/md array, it takes over
  11 minutes and if we examine the stack, it is stuck in
  blkdev_issue_discard()
  
  $ sudo cat /proc/1626/stack
  [<0>] wait_barrier+0x14c/0x230 [raid10]
  [<0>] regular_request_wait+0x39/0x150 [raid10]
  [<0>] raid10_write_request+0x11e/0x850 [raid10]
  [<0>] raid10_make_request+0xd7/0x150 [raid10]
  [<0>] md_handle_request+0x123/0x1a0
  [<0>] md_submit_bio+0xda/0x120
  [<0>] __submit_bio_noacct+0xde/0x320
  [<0>] submit_bio_noacct+0x4d/0x90
  [<0>] submit_bio+0x4f/0x1b0
  [<0>] __blkdev_issue_discard+0x154/0x290
  [<0>] blkdev_issue_discard+0x5d/0xc0
  [<0>] blk_ioctl_discard+0xc4/0x110
  [<0>] blkdev_common_ioctl+0x56c/0x840
  [<0>] blkdev_ioctl+0xeb/0x270
  [<0>] block_ioctl+0x3d/0x50
  [<0>] __x64_sys_ioctl+0x91/0xc0
  [<0>] do_syscall_64+0x38/0x90
- [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 
+ [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
  
  [Fix]
  
  Xiao Ni has developed a patchset which resolves the block discard
  performance problems. It is currently in the md-next tree [1], and I am
  expecting the commits to be merged during the 5.10 merge window.
  
  [1] https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h
  =md-next
  
  commit 5b2374a6c221f28c74913d208bb5376a7ee3bf70
  Author: Xiao Ni <x...@redhat.com>
  Date: Wed Sep 2 20:00:23 2020 +0800
  Subject: md/raid10: improve discard request for far layout
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=5b2374a6c221f28c74913d208bb5376a7ee3bf70
  
  commit 8f694215ae4c7abf1e6c985803a1aad0db748d07
  Author: Xiao Ni <x...@redhat.com>
  Date: Wed Sep 2 20:00:22 2020 +0800
  Subject: md/raid10: improve raid10 discard request
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=8f694215ae4c7abf1e6c985803a1aad0db748d07
  
  commit 6fcfa8732a8cfea7828a9444c855691c481ee557
  Author: Xiao Ni <x...@redhat.com>
  Date: Tue Aug 25 13:43:01 2020 +0800
  Subject: md/raid10: pull codes that wait for blocked dev into one function
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=6fcfa8732a8cfea7828a9444c855691c481ee557
  
  commit 6f4fed152a5e483af2227156ce7b6263aeeb5c84
  Author: Xiao Ni <x...@redhat.com>
  Date: Tue Aug 25 13:43:00 2020 +0800
  Subject: md/raid10: extend r10bio devs to raid disks
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=6f4fed152a5e483af2227156ce7b6263aeeb5c84
  
  commit 7197f1a616caf85508d81c7f5c9f065ffaebf027
  Author: Xiao Ni <x...@redhat.com>
  Date: Tue Aug 25 13:42:59 2020 +0800
  Subject: md: add md_submit_discard_bio() for submitting discard bio
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=7197f1a616caf85508d81c7f5c9f065ffaebf027
  
  It follows a similar strategy which was implemented in Raid0 in the
  below commit, which was merged in 4.12-rc2:
  
  commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0
  Author: Shaohua Li <s...@fb.com>
  Date: Sun May 7 17:36:24 2017 -0700
  Subject: md/md0: optimize raid0 discard handling
- Link: 
https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0
 
+ Link: 
https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0
  
  [Testcase]
  
  You will need a machine with at least 4x NVMe drives which support block
  discard. I use a i3.8xlarge instance on AWS, since it has all of these
  things.
  
- $ lsblk 
- xvda    202:0    0    8G  0 disk 
+ $ lsblk
+ xvda    202:0    0    8G  0 disk
  └─xvda1 202:1    0    8G  0 part /
- nvme0n1 259:2    0  1.7T  0 disk 
- nvme1n1 259:0    0  1.7T  0 disk 
- nvme2n1 259:1    0  1.7T  0 disk 
+ nvme0n1 259:2    0  1.7T  0 disk
+ nvme1n1 259:0    0  1.7T  0 disk
+ nvme2n1 259:1    0  1.7T  0 disk
  nvme3n1 259:3    0  1.7T  0 disk
  
  Create a Raid10 array:
  
  $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4
  /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
  
  Format the array with XFS:
  
  $ time sudo mkfs.xfs /dev/md0
- real 11m14.734s 
+ real 11m14.734s
  
  $ sudo mkdir /mnt/disk
  $ sudo mount /dev/md0 /mnt/disk
  
  Optional, do a fstrim:
  
  $ time sudo fstrim /mnt/disk
  
  real    11m37.643s
  
  I built a test kernel based on 5.9-rc6 with the above patches, and we
  can see that performance dramatically improves:
  
  $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4
  /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
  
  $ time sudo mkfs.xfs /dev/md0
  real  0m4.226s
  user  0m0.020s
  sys   0m0.148s
  
  $ sudo mkdir /mnt/disk
  $ sudo mount /dev/md0 /mnt/disk
  $ time sudo fstrim /mnt/disk
  
  real  0m1.991s
  user  0m0.020s
  sys   0m0.000s
  
  The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim
  from 11 minutes to 2 seconds.
  
  [Regression Potential]
  
  If a regression were to occur, then it would affect operations which
  would trigger block discard operations, such as mkfs and fstrim, on
  Raid10 only.
  
  Other Raid levels would not be affected, although, I should note there
  will be a small risk of regression to Raid0, due to one of its functions
  being re-factored and split out, for use in both Raid0 and Raid10.
  
  The changes only affect block discard, so only Raid10 arrays backed by
  SSD or NVMe devices which support block discard will be affected.
  Traditional hard disks, or SSD devices which do not support block
  discard would not be affected.
  
  If a regression were to occur, users could work around the issue by
  running "mkfs.xfs -K <device>" which would skip block discard entirely.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1896578

Title:
  raid10: Block discard is very slow, causing severe delays for mkfs and
  fstrim operations

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Focal:
  In Progress
Status in linux source package in Groovy:
  In Progress

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1896578

  [Impact]

  Block discard is very slow on Raid10, which causes common use cases
  which invoke block discard, such as mkfs and fstrim operations, to
  take a very long time.

  For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe
  devices which support block discard, a mkfs.xfs operation on Raid 10
  takes between 8 to 11 minutes, where the same mkfs.xfs operation on
  Raid 0, takes 4 seconds.

  The bigger the devices, the longer it takes.

  The cause is that Raid10 currently uses a 512k chunk size, and uses
  this for the discard_max_bytes value. If we need to discard 1.9TB, the
  kernel splits the request into millions of 512k bio requests, even if
  the underlying device supports larger requests.

  For example, the NVMe devices on i3.8xlarge support 2.2TB of discard
  at once:

  $ cat /sys/block/nvme0n1/queue/discard_max_bytes
  2199023255040
  $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
  2199023255040

  Where the Raid10 md device only supports 512k:

  $ cat /sys/block/md0/queue/discard_max_bytes
  524288
  $ cat /sys/block/md0/queue/discard_max_hw_bytes
  524288

  If we perform a mkfs.xfs operation on the /dev/md array, it takes over
  11 minutes and if we examine the stack, it is stuck in
  blkdev_issue_discard()

  $ sudo cat /proc/1626/stack
  [<0>] wait_barrier+0x14c/0x230 [raid10]
  [<0>] regular_request_wait+0x39/0x150 [raid10]
  [<0>] raid10_write_request+0x11e/0x850 [raid10]
  [<0>] raid10_make_request+0xd7/0x150 [raid10]
  [<0>] md_handle_request+0x123/0x1a0
  [<0>] md_submit_bio+0xda/0x120
  [<0>] __submit_bio_noacct+0xde/0x320
  [<0>] submit_bio_noacct+0x4d/0x90
  [<0>] submit_bio+0x4f/0x1b0
  [<0>] __blkdev_issue_discard+0x154/0x290
  [<0>] blkdev_issue_discard+0x5d/0xc0
  [<0>] blk_ioctl_discard+0xc4/0x110
  [<0>] blkdev_common_ioctl+0x56c/0x840
  [<0>] blkdev_ioctl+0xeb/0x270
  [<0>] block_ioctl+0x3d/0x50
  [<0>] __x64_sys_ioctl+0x91/0xc0
  [<0>] do_syscall_64+0x38/0x90
  [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

  [Fix]

  Xiao Ni has developed a patchset which resolves the block discard
  performance problems. It is currently in the md-next tree [1], and I
  am expecting the commits to be merged during the 5.10 merge window.

  [1] https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h
  =md-next

  commit 5b2374a6c221f28c74913d208bb5376a7ee3bf70
  Author: Xiao Ni <x...@redhat.com>
  Date: Wed Sep 2 20:00:23 2020 +0800
  Subject: md/raid10: improve discard request for far layout
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=5b2374a6c221f28c74913d208bb5376a7ee3bf70

  commit 8f694215ae4c7abf1e6c985803a1aad0db748d07
  Author: Xiao Ni <x...@redhat.com>
  Date: Wed Sep 2 20:00:22 2020 +0800
  Subject: md/raid10: improve raid10 discard request
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=8f694215ae4c7abf1e6c985803a1aad0db748d07

  commit 6fcfa8732a8cfea7828a9444c855691c481ee557
  Author: Xiao Ni <x...@redhat.com>
  Date: Tue Aug 25 13:43:01 2020 +0800
  Subject: md/raid10: pull codes that wait for blocked dev into one function
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=6fcfa8732a8cfea7828a9444c855691c481ee557

  commit 6f4fed152a5e483af2227156ce7b6263aeeb5c84
  Author: Xiao Ni <x...@redhat.com>
  Date: Tue Aug 25 13:43:00 2020 +0800
  Subject: md/raid10: extend r10bio devs to raid disks
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=6f4fed152a5e483af2227156ce7b6263aeeb5c84

  commit 7197f1a616caf85508d81c7f5c9f065ffaebf027
  Author: Xiao Ni <x...@redhat.com>
  Date: Tue Aug 25 13:42:59 2020 +0800
  Subject: md: add md_submit_discard_bio() for submitting discard bio
  Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?h=md-next&id=7197f1a616caf85508d81c7f5c9f065ffaebf027

  It follows a similar strategy which was implemented in Raid0 in the
  below commit, which was merged in 4.12-rc2:

  commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0
  Author: Shaohua Li <s...@fb.com>
  Date: Sun May 7 17:36:24 2017 -0700
  Subject: md/md0: optimize raid0 discard handling
  Link: 
https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0

  [Testcase]

  You will need a machine with at least 4x NVMe drives which support
  block discard. I use a i3.8xlarge instance on AWS, since it has all of
  these things.

  $ lsblk
  xvda    202:0    0    8G  0 disk
  └─xvda1 202:1    0    8G  0 part /
  nvme0n1 259:2    0  1.7T  0 disk
  nvme1n1 259:0    0  1.7T  0 disk
  nvme2n1 259:1    0  1.7T  0 disk
  nvme3n1 259:3    0  1.7T  0 disk

  Create a Raid10 array:

  $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4
  /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

  Format the array with XFS:

  $ time sudo mkfs.xfs /dev/md0
  real 11m14.734s

  $ sudo mkdir /mnt/disk
  $ sudo mount /dev/md0 /mnt/disk

  Optional, do a fstrim:

  $ time sudo fstrim /mnt/disk

  real    11m37.643s

  I built a test kernel based on 5.9-rc6 with the above patches, and we
  can see that performance dramatically improves:

  $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4
  /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

  $ time sudo mkfs.xfs /dev/md0
  real  0m4.226s
  user  0m0.020s
  sys   0m0.148s

  $ sudo mkdir /mnt/disk
  $ sudo mount /dev/md0 /mnt/disk
  $ time sudo fstrim /mnt/disk

  real  0m1.991s
  user  0m0.020s
  sys   0m0.000s

  The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and
  fstrim from 11 minutes to 2 seconds.

  [Regression Potential]

  If a regression were to occur, then it would affect operations which
  would trigger block discard operations, such as mkfs and fstrim, on
  Raid10 only.

  Other Raid levels would not be affected, although, I should note there
  will be a small risk of regression to Raid0, due to one of its
  functions being re-factored and split out, for use in both Raid0 and
  Raid10.

  The changes only affect block discard, so only Raid10 arrays backed by
  SSD or NVMe devices which support block discard will be affected.
  Traditional hard disks, or SSD devices which do not support block
  discard would not be affected.

  If a regression were to occur, users could work around the issue by
  running "mkfs.xfs -K <device>" which would skip block discard
  entirely.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1896578] [NEW] raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

Reply via email to