Resending, since somehow, I didn't reply-to-all last time.

On Thu, Dec 11, 2025 at 03:03:12PM +0800, Yongpeng Yang wrote:
> On 12/10/25 14:07, Benjamin Marzinski wrote:
> > On Mon, Dec 08, 2025 at 06:07:14PM +0800, Yongpeng Yang wrote:
> >> From: Yongpeng Yang <[email protected]>
> >>
> >> Currently, the max_hw_discard_sectors of a stripe target is set to the
> >> minimum max_hw_discard_sectors among all sub devices. When the discard
> >> bio is larger than max_hw_discard_sectors, this may cause the stripe
> >> device to split discard bios unnecessarily, because the value of
> >> max_hw_discard_sectors affects max_discard_sectors, which equal to
> >> min(max_hw_discard_sectors, max_user_discard_sectors).
> >>
> >> For example:
> >> root@vm:~# echo '0 33554432 striped 2 256 /dev/vdd 0 /dev/vde 0' | dmsetup 
> >> create stripe_dev
> >> root@vm:~# cat /sys/block/dm-1/queue/discard_max_bytes
> >> 536870912
> >> root@vm:~# cat /sys/block/dm-1/slaves/vdd/queue/discard_max_bytes
> >> 536870912
> >> root@vm:~# blkdiscard -o 0 -l 1073741824 -p 1073741824 
> >> /dev/mapper/stripe_dev
> >>
> >> dm-1 is the stripe device, and its discard_max_bytes is equal to
> >> each sub device’s discard_max_bytes. Since the requested discard
> >> length exceeds discard_max_bytes, the block layer splits the discard bio:
> >>
> >> block_bio_queue: 252,1 DS 0 + 2097152 [blkdiscard]
> >> block_split: 252,1 DS 0 / 1048576 [blkdiscard]
> >> block_rq_issue: 253,48 DS 268435456 () 0 + 524288 be,0,4 [blkdiscard]
> >> block_bio_queue: 253,64 DS 524288 + 524288 [blkdiscard]
> >>
> >> However, both vdd and vde can actually handle a discard bio of 536870912
> >> bytes, so this split is not necessary.
> >>
> >> This patch updates the stripe target’s q->limits.max_hw_discard_sectors
> >> to be the minimum max_hw_discard_sectors of the sub devices multiplied
> >> by the # of stripe devices. This enables the stripe device to handle
> >> larger discard bios without incurring unnecessary splitting.
> >>
> >> Signed-off-by: Yongpeng Yang <[email protected]>
> >> ---
> >>  drivers/md/dm-stripe.c | 16 +++++++++++++++-
> >>  1 file changed, 15 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
> >> index 1461dc740dae..799d6def699b 100644
> >> --- a/drivers/md/dm-stripe.c
> >> +++ b/drivers/md/dm-stripe.c
> >> @@ -38,6 +38,9 @@ struct stripe_c {
> >>    uint32_t chunk_size;
> >>    int chunk_size_shift;
> >>  
> >> +  /* The minimum max_hw_discard_sectors of all sub devices. */
> >> +  unsigned int max_hw_discard_sectors;
> >> +
> >>    /* Needed for handling events */
> >>    struct dm_target *ti;
> >>  
> >> @@ -169,6 +172,8 @@ static int stripe_ctr(struct dm_target *ti, unsigned 
> >> int argc, char **argv)
> >>     * Get the stripe destinations.
> >>     */
> >>    for (i = 0; i < stripes; i++) {
> >> +          struct request_queue *q;
> >> +
> >>            argv += 2;
> >>  
> >>            r = get_stripe(ti, sc, i, argv);
> >> @@ -180,6 +185,13 @@ static int stripe_ctr(struct dm_target *ti, unsigned 
> >> int argc, char **argv)
> >>                    return r;
> >>            }
> >>            atomic_set(&(sc->stripe[i].error_count), 0);
> >> +
> >> +          q = bdev_get_queue(sc->stripe[i].dev->bdev);
> >> +          if (i == 0)
> >> +                  sc->max_hw_discard_sectors = 
> >> q->limits.max_hw_discard_sectors;
> >> +          else
> >> +                  sc->max_hw_discard_sectors = 
> >> min_not_zero(sc->max_hw_discard_sectors,
> >> +                                  q->limits.max_hw_discard_sectors);
> > 
> > I don't think any of the above is necessary. When stripe_io_hints() is
> > called, dm_set_device_limits() will already have been called on all the
> > underlying stripe devices to combine the limits. So
> > limits->max_hw_discard_sectors should already be set to the same value
> > as you're computing for sc->max_hw_discard_sectors. Right?
> 
> dm_set_device_limits() initializes the stack-local ti_limits defined in
> dm_calculate_queue_limits(), and stripe_io_hints() modifies this same
> variable as well.

Yep. That's why it's o.k. to use limits->max_hw_discard_sectors in
stripe_io_hints(). dm_set_device_limits() -> blk_stack_limits()
is called on all of the underlying stripe devices before
stripe_io_hints() is called. blk_stack_limits() includes the line

t->max_hw_discard_sectors = min_not_zero(t->max_hw_discard_sectors,
                                         b->max_hw_discard_sectors);

Which updates ti_limits.max_hw_discard_sectors. ti_limits is then passed
into stripe_io_hints(), and has the same value that you compute in
stripe_ctr().

> These updates do not affect the stripe device’s
> q->limits until dm_calculate_queue_limits() later calls
> blk_stack_limits().

Actually, we don't update q->limits for the stripe device itself until
later, via __bind() -> dm_table_set_restrictions() ->
queue_limits_commit_update(). But we're not checking the queue limits of
the stripe device. We are checking the queue limits of the underlying
devices. Those devices are completely set up before we create the dm
device on top of them.

> Therefore, the q->limits.max_hw_discard_sectors of
> each sub device in a stripe target is never modified and it is necessary
> to record the minimum q->limits.max_hw_discard_sectors
> across all sub devices.

You are checking the same underlying q->limits.max_hw_discard of
each sub device in stripe_ctr(). stripe_ctr() is called when we set up
the table, well before we call dm_calculate_queue_limits() as part
of making the table active in dm_swap_table(). So if the
q->limits.max_hw_discard of the underlying devices is correct in
stripe_ctr(), it surely must be correct when dm_set_device_limits()
is called on them.

If you don't believe be try adding
DMERR("%u vs %u", sc->max_hw_discard_sectors, limits->max_hw_discard_sectors);

to stripe_io_hints(). It should show that they have the same number. If
it doesn't, I apologize for all this noise.
 
> > 
> >>    }
> >>  
> >>    ti->private = sc;
> >> @@ -456,7 +468,7 @@ static void stripe_io_hints(struct dm_target *ti,
> >>                        struct queue_limits *limits)
> >>  {
> >>    struct stripe_c *sc = ti->private;
> >> -  unsigned int io_min, io_opt;
> >> +  unsigned int io_min, io_opt, max_hw_discard_sectors;
> >>  
> >>    limits->chunk_sectors = sc->chunk_size;
> >>  
> >> @@ -465,6 +477,8 @@ static void stripe_io_hints(struct dm_target *ti,
> >>            limits->io_min = io_min;
> >>            limits->io_opt = io_opt;
> >>    }
> >> +  if (!check_mul_overflow(sc->max_hw_discard_sectors, sc->stripes, 
> >> &max_hw_discard_sectors))
> > 
> > sc->max_hw_discard_sectors should be the same as
> > limits->max_hw_discard_sectors here 
> > 
> >> +          limits->max_hw_discard_sectors = max_hw_discard_sectors;
> > 
> > I see a couple of issues with this calculation. First, this only works
> > if max_hw_discard_sectors is greater than sc->chunk_size. But more than
> > that, before you multiply the original limit by the number of stripes,
> > you must round it down to a multiple of chunk_size. Otherwise, you can
> > end up writing too much to some of the underlying devices. To show this
> > with simple numbers, imagine you had 3 stripes that with a chunk_size of
> > 4 and all the underlying devices had a max_hw_discard_sectors of 5. You
> > would set max_hw_discard_sectors to 5 * 3 = 15. But if you discared 15
> > sectors from the beginning of the device, you would end up discarding
> > the first 4 sectors of each underlying device, and then loop around
> > discard 3 more sectors of the first device. This means that you would
> > discard 7 from the first device, instead of the 5 that it could handle.
> > Rounding max_hw_discard_sectors down to a multiple of chunk_size will
> > fix that.
> 
> The splitting of discard bios is based on limits->max_discard_sectors,
> which is independent of the stripe size. As a result, the discard bio
> size passed to stripe_map->stripe_map_range() may exceed
> `sc->chunk_size * sc->stripes`.

Of course this is true, but my example still holds if you had a
max_hw_discard_sectors of 9 instead. 9 * 3 = 27, so you would discard
the first 8 sectors (two chunks) of each stripe device, and then 3 more
sectors of the first stripe device. This means that you would send a
discard of 11 sectors to the first device, instead of the 9 that it can
handle at once.

> In the example above, stripe_map_range()
> is invoked three times, and each invocation reduces bio->bi_iter.bi_size
> to 5 sectors for each sub devices.

If it did that, that would be a bug.  You were intending to discard
the first 15 sectors of the dm device, but instead, you would be
discarding the first 13 sectors (0-12), plus sector 16 and sector 20.

It's easy to test my original thought experiment using a scsi debug
device with a discard granularity of 1 and a max_hw_discard_sectors of
5, and a stripe device with a 4 sector stripe size. Here's the test:

# modprobe scsi_debug dev_size_mb=128 max_luns=3 lbpu=1 unmap_max_blocks=5
# echo '0 786432 striped 3 4 /dev/sda 0 /dev/sdb 0 /dev/sdc 0' | create 
stripe_dev
# blkdiscard -o 0 -l 7680 /dev/mapper/stripe_dev

And here's the perf results:

block:block_bio_queue: 253,1 DS 0 + 15 [blkdiscard]
block:block_bio_remap: 8,0 DS 0 + 7 <- (253,1) 0
block:block_bio_queue: 8,0 DS 0 + 7 [blkdiscard]
block:block_bio_remap: 8,16 DS 0 + 4 <- (253,1) 0
block:block_bio_queue: 8,16 DS 0 + 4 [blkdiscard]
block:block_bio_remap: 8,32 DS 0 + 4 <- (253,1) 0
block:block_bio_queue: 8,32 DS 0 + 4 [blkdiscard]
    block:block_split: 8,0 DS 0 / 5 [blkdiscard] 
 block:block_rq_issue: 8,0 DS 2560 () 0 + 5 0x2,0,4 [kworker/5:1H]
 block:block_rq_issue: 8,0 DS 1024 () 5 + 2 0x2,0,4 [kworker/5:1H]
 block:block_rq_issue: 8,16 DS 2048 () 0 + 4 0x2,0,4 [kworker/5:1H] 
 block:block_rq_issue: 8,32 DS 2048 () 0 + 4 0x2,0,4 [kworker/5:1H] 

As you can see, it sends 7 sectors down to sda (8:0), which is too
large, so it needs to get split by the scsi device.

-Ben

> > 
> > Lastly, if you do overflow when multiplying max_hw_discard_sectors by
> > the number of stripes, you should probably just set
> > limits->max_hw_discard_sectors to UINT_MAX >> SECTOR_SHIFT, instead of
> > leaving it at what it was.
> 
> Yes, I overlooked that. I’ll fix it in v2.
> 
> Thanks
> Yongpeng,
> 
> > 
> > -Ben
> > 
> >>  }
> >>  
> >>  static struct target_type stripe_target = {
> >> -- 
> >> 2.43.0
> > 


Reply via email to