On Fri, Apr 11, 2025 at 09:04:35AM +0900, Damien Le Moal wrote:
> When a dm-delay device is being suspended, the .presuspend() operation
> is first executed (delay_presuspend()) to immediately issue all the BIOs
> present in the delayed list of the device and also sets the device
> may_delay boolean to false. At the same time, if any new BIO is issued
> to the device will not be delayed and immediately issued with
> delay_bio() returning DM_MAPIO_REMAPPED. This creates a situation where
> potentially 2 different contexts may be issuing write BIOs to the same
> zone of a zone device without respecting the issuing order from the
> user, that is, a newly issued write BIO may be issued before other write
> BIOs for the same target zone that are in the device delayed list. If
> such situation occurs, write BIOs may be failed by the underlying zoned
> device due to an unaligned write error.
> 
> Prevent this situation from happening by always handling newly issued
> write BIOs using the delayed list of BIOs, even when the device is being
> suspended. This is done by forcing the use of the worker kthread for
> zoned devices, and by modifying flush_worker_fn() to always flush all
> delayed BIOs if the device may_delay boolean is false.
> 
> Reported-by: Benjamin Marzinski <bmarz...@redhat.com>
> Fixes: d43929ef65a6 ("dm-delay: support zoned devices")
> Signed-off-by: Damien Le Moal <dlem...@kernel.org>

This looks fine, other than the issue that flush_worker_fn() busy-waits
for an IO to be ready, and now it could do that for much longer. But
I just submitted a patch to fix that, so:

Reviewed-by: Benjamin Marzinski <bmarz...@redhat.com>

> ---
> Changes from v1:
>  - Fixed typo in commit message
>  - Added reported-by tag
> 
>  drivers/md/dm-delay.c | 29 +++++++++++++++++++++--------
>  1 file changed, 21 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/md/dm-delay.c b/drivers/md/dm-delay.c
> index d4cf0ac2a7aa..c665b2ab1115 100644
> --- a/drivers/md/dm-delay.c
> +++ b/drivers/md/dm-delay.c
> @@ -128,7 +128,7 @@ static int flush_worker_fn(void *data)
>       struct delay_c *dc = data;
>  
>       while (!kthread_should_stop()) {
> -             flush_delayed_bios(dc, false);
> +             flush_delayed_bios(dc, !dc->may_delay);
>               spin_lock(&dc->delayed_bios_lock);
>               if (unlikely(list_empty(&dc->delayed_bios))) {
>                       set_current_state(TASK_INTERRUPTIBLE);
> @@ -213,6 +213,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int 
> argc, char **argv)
>       struct delay_c *dc;
>       int ret;
>       unsigned int max_delay;
> +     bool is_zoned = false;
>  
>       if (argc != 3 && argc != 6 && argc != 9) {
>               ti->error = "Requires exactly 3, 6 or 9 arguments";
> @@ -236,6 +237,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int 
> argc, char **argv)
>       if (ret)
>               goto bad;
>       max_delay = dc->read.delay;
> +     is_zoned = bdev_is_zoned(dc->read.dev->bdev);
>  
>       if (argc == 3) {
>               ret = delay_class_ctr(ti, &dc->write, argv);
> @@ -251,6 +253,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int 
> argc, char **argv)
>       if (ret)
>               goto bad;
>       max_delay = max(max_delay, dc->write.delay);
> +     is_zoned = is_zoned || bdev_is_zoned(dc->write.dev->bdev);
>  
>       if (argc == 6) {
>               ret = delay_class_ctr(ti, &dc->flush, argv + 3);
> @@ -263,13 +266,16 @@ static int delay_ctr(struct dm_target *ti, unsigned int 
> argc, char **argv)
>       if (ret)
>               goto bad;
>       max_delay = max(max_delay, dc->flush.delay);
> +     is_zoned = is_zoned || bdev_is_zoned(dc->flush.dev->bdev);
>  
>  out:
> -     if (max_delay < 50) {
> -             /*
> -              * In case of small requested delays, use kthread instead of
> -              * timers and workqueue to achieve better latency.
> -              */
> +     /*
> +      * In case of small requested delays, use kthread instead of timers and
> +      * workqueue to achieve better latency. Also use a kthread for a zoned
> +      * device so that we can preserve the order of write operations during
> +      * suspend.
> +      */
> +     if (max_delay < 50 || is_zoned) {
>               dc->worker = kthread_run(&flush_worker_fn, dc, 
> "dm-delay-flush-worker");
>               if (IS_ERR(dc->worker)) {
>                       ret = PTR_ERR(dc->worker);
> @@ -313,8 +319,15 @@ static int delay_bio(struct delay_c *dc, struct 
> delay_class *c, struct bio *bio)
>  
>       spin_lock(&dc->delayed_bios_lock);
>       if (unlikely(!dc->may_delay)) {
> -             spin_unlock(&dc->delayed_bios_lock);
> -             return DM_MAPIO_REMAPPED;
> +             /*
> +              * Issue the BIO immediately if the device is not zoned. FOr a
> +              * zoned device, preserver the ordering of write operations by
> +              * using the delay list.
> +              */
> +             if (!bdev_is_zoned(c->dev->bdev) || c != &dc->write) {
> +                     spin_unlock(&dc->delayed_bios_lock);
> +                     return DM_MAPIO_REMAPPED;
> +             }
>       }
>       c->ops++;
>       list_add_tail(&delayed->list, &dc->delayed_bios);
> -- 
> 2.49.0


Reply via email to