On 12/10/25 14:07, Benjamin Marzinski wrote:
> On Mon, Dec 08, 2025 at 06:07:14PM +0800, Yongpeng Yang wrote:
>> From: Yongpeng Yang <[email protected]>
>>
>> Currently, the max_hw_discard_sectors of a stripe target is set to the
>> minimum max_hw_discard_sectors among all sub devices. When the discard
>> bio is larger than max_hw_discard_sectors, this may cause the stripe
>> device to split discard bios unnecessarily, because the value of
>> max_hw_discard_sectors affects max_discard_sectors, which equal to
>> min(max_hw_discard_sectors, max_user_discard_sectors).
>>
>> For example:
>> root@vm:~# echo '0 33554432 striped 2 256 /dev/vdd 0 /dev/vde 0' | dmsetup
>> create stripe_dev
>> root@vm:~# cat /sys/block/dm-1/queue/discard_max_bytes
>> 536870912
>> root@vm:~# cat /sys/block/dm-1/slaves/vdd/queue/discard_max_bytes
>> 536870912
>> root@vm:~# blkdiscard -o 0 -l 1073741824 -p 1073741824 /dev/mapper/stripe_dev
>>
>> dm-1 is the stripe device, and its discard_max_bytes is equal to
>> each sub device’s discard_max_bytes. Since the requested discard
>> length exceeds discard_max_bytes, the block layer splits the discard bio:
>>
>> block_bio_queue: 252,1 DS 0 + 2097152 [blkdiscard]
>> block_split: 252,1 DS 0 / 1048576 [blkdiscard]
>> block_rq_issue: 253,48 DS 268435456 () 0 + 524288 be,0,4 [blkdiscard]
>> block_bio_queue: 253,64 DS 524288 + 524288 [blkdiscard]
>>
>> However, both vdd and vde can actually handle a discard bio of 536870912
>> bytes, so this split is not necessary.
>>
>> This patch updates the stripe target’s q->limits.max_hw_discard_sectors
>> to be the minimum max_hw_discard_sectors of the sub devices multiplied
>> by the # of stripe devices. This enables the stripe device to handle
>> larger discard bios without incurring unnecessary splitting.
>>
>> Signed-off-by: Yongpeng Yang <[email protected]>
>> ---
>> drivers/md/dm-stripe.c | 16 +++++++++++++++-
>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c
>> index 1461dc740dae..799d6def699b 100644
>> --- a/drivers/md/dm-stripe.c
>> +++ b/drivers/md/dm-stripe.c
>> @@ -38,6 +38,9 @@ struct stripe_c {
>> uint32_t chunk_size;
>> int chunk_size_shift;
>>
>> + /* The minimum max_hw_discard_sectors of all sub devices. */
>> + unsigned int max_hw_discard_sectors;
>> +
>> /* Needed for handling events */
>> struct dm_target *ti;
>>
>> @@ -169,6 +172,8 @@ static int stripe_ctr(struct dm_target *ti, unsigned int
>> argc, char **argv)
>> * Get the stripe destinations.
>> */
>> for (i = 0; i < stripes; i++) {
>> + struct request_queue *q;
>> +
>> argv += 2;
>>
>> r = get_stripe(ti, sc, i, argv);
>> @@ -180,6 +185,13 @@ static int stripe_ctr(struct dm_target *ti, unsigned
>> int argc, char **argv)
>> return r;
>> }
>> atomic_set(&(sc->stripe[i].error_count), 0);
>> +
>> + q = bdev_get_queue(sc->stripe[i].dev->bdev);
>> + if (i == 0)
>> + sc->max_hw_discard_sectors =
>> q->limits.max_hw_discard_sectors;
>> + else
>> + sc->max_hw_discard_sectors =
>> min_not_zero(sc->max_hw_discard_sectors,
>> + q->limits.max_hw_discard_sectors);
>
> I don't think any of the above is necessary. When stripe_io_hints() is
> called, dm_set_device_limits() will already have been called on all the
> underlying stripe devices to combine the limits. So
> limits->max_hw_discard_sectors should already be set to the same value
> as you're computing for sc->max_hw_discard_sectors. Right?
dm_set_device_limits() initializes the stack-local ti_limits defined in
dm_calculate_queue_limits(), and stripe_io_hints() modifies this same
variable as well. These updates do not affect the stripe device’s
q->limits until dm_calculate_queue_limits() later calls
blk_stack_limits(). Therefore, the q->limits.max_hw_discard_sectors of
each sub device in a stripe target is never modified and it is necessary
to record the minimum q->limits.max_hw_discard_sectors
across all sub devices.
>
>> }
>>
>> ti->private = sc;
>> @@ -456,7 +468,7 @@ static void stripe_io_hints(struct dm_target *ti,
>> struct queue_limits *limits)
>> {
>> struct stripe_c *sc = ti->private;
>> - unsigned int io_min, io_opt;
>> + unsigned int io_min, io_opt, max_hw_discard_sectors;
>>
>> limits->chunk_sectors = sc->chunk_size;
>>
>> @@ -465,6 +477,8 @@ static void stripe_io_hints(struct dm_target *ti,
>> limits->io_min = io_min;
>> limits->io_opt = io_opt;
>> }
>> + if (!check_mul_overflow(sc->max_hw_discard_sectors, sc->stripes,
>> &max_hw_discard_sectors))
>
> sc->max_hw_discard_sectors should be the same as
> limits->max_hw_discard_sectors here
>
>> + limits->max_hw_discard_sectors = max_hw_discard_sectors;
>
> I see a couple of issues with this calculation. First, this only works
> if max_hw_discard_sectors is greater than sc->chunk_size. But more than
> that, before you multiply the original limit by the number of stripes,
> you must round it down to a multiple of chunk_size. Otherwise, you can
> end up writing too much to some of the underlying devices. To show this
> with simple numbers, imagine you had 3 stripes that with a chunk_size of
> 4 and all the underlying devices had a max_hw_discard_sectors of 5. You
> would set max_hw_discard_sectors to 5 * 3 = 15. But if you discared 15
> sectors from the beginning of the device, you would end up discarding
> the first 4 sectors of each underlying device, and then loop around
> discard 3 more sectors of the first device. This means that you would
> discard 7 from the first device, instead of the 5 that it could handle.
> Rounding max_hw_discard_sectors down to a multiple of chunk_size will
> fix that.
The splitting of discard bios is based on limits->max_discard_sectors,
which is independent of the stripe size. As a result, the discard bio
size passed to stripe_map->stripe_map_range() may exceed
`sc->chunk_size * sc->stripes`. In the example above, stripe_map_range()
is invoked three times, and each invocation reduces bio->bi_iter.bi_size
to 5 sectors for each sub devices.
>
> Lastly, if you do overflow when multiplying max_hw_discard_sectors by
> the number of stripes, you should probably just set
> limits->max_hw_discard_sectors to UINT_MAX >> SECTOR_SHIFT, instead of
> leaving it at what it was.
Yes, I overlooked that. I’ll fix it in v2.
Thanks
Yongpeng,
>
> -Ben
>
>> }
>>
>> static struct target_type stripe_target = {
>> --
>> 2.43.0
>