On Fri, Apr 11, 2025 at 09:04:35AM +0900, Damien Le Moal wrote: > When a dm-delay device is being suspended, the .presuspend() operation > is first executed (delay_presuspend()) to immediately issue all the BIOs > present in the delayed list of the device and also sets the device > may_delay boolean to false. At the same time, if any new BIO is issued > to the device will not be delayed and immediately issued with > delay_bio() returning DM_MAPIO_REMAPPED. This creates a situation where > potentially 2 different contexts may be issuing write BIOs to the same > zone of a zone device without respecting the issuing order from the > user, that is, a newly issued write BIO may be issued before other write > BIOs for the same target zone that are in the device delayed list. If > such situation occurs, write BIOs may be failed by the underlying zoned > device due to an unaligned write error. > > Prevent this situation from happening by always handling newly issued > write BIOs using the delayed list of BIOs, even when the device is being > suspended. This is done by forcing the use of the worker kthread for > zoned devices, and by modifying flush_worker_fn() to always flush all > delayed BIOs if the device may_delay boolean is false. > > Reported-by: Benjamin Marzinski <bmarz...@redhat.com> > Fixes: d43929ef65a6 ("dm-delay: support zoned devices") > Signed-off-by: Damien Le Moal <dlem...@kernel.org>
This looks fine, other than the issue that flush_worker_fn() busy-waits for an IO to be ready, and now it could do that for much longer. But I just submitted a patch to fix that, so: Reviewed-by: Benjamin Marzinski <bmarz...@redhat.com> > --- > Changes from v1: > - Fixed typo in commit message > - Added reported-by tag > > drivers/md/dm-delay.c | 29 +++++++++++++++++++++-------- > 1 file changed, 21 insertions(+), 8 deletions(-) > > diff --git a/drivers/md/dm-delay.c b/drivers/md/dm-delay.c > index d4cf0ac2a7aa..c665b2ab1115 100644 > --- a/drivers/md/dm-delay.c > +++ b/drivers/md/dm-delay.c > @@ -128,7 +128,7 @@ static int flush_worker_fn(void *data) > struct delay_c *dc = data; > > while (!kthread_should_stop()) { > - flush_delayed_bios(dc, false); > + flush_delayed_bios(dc, !dc->may_delay); > spin_lock(&dc->delayed_bios_lock); > if (unlikely(list_empty(&dc->delayed_bios))) { > set_current_state(TASK_INTERRUPTIBLE); > @@ -213,6 +213,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int > argc, char **argv) > struct delay_c *dc; > int ret; > unsigned int max_delay; > + bool is_zoned = false; > > if (argc != 3 && argc != 6 && argc != 9) { > ti->error = "Requires exactly 3, 6 or 9 arguments"; > @@ -236,6 +237,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int > argc, char **argv) > if (ret) > goto bad; > max_delay = dc->read.delay; > + is_zoned = bdev_is_zoned(dc->read.dev->bdev); > > if (argc == 3) { > ret = delay_class_ctr(ti, &dc->write, argv); > @@ -251,6 +253,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int > argc, char **argv) > if (ret) > goto bad; > max_delay = max(max_delay, dc->write.delay); > + is_zoned = is_zoned || bdev_is_zoned(dc->write.dev->bdev); > > if (argc == 6) { > ret = delay_class_ctr(ti, &dc->flush, argv + 3); > @@ -263,13 +266,16 @@ static int delay_ctr(struct dm_target *ti, unsigned int > argc, char **argv) > if (ret) > goto bad; > max_delay = max(max_delay, dc->flush.delay); > + is_zoned = is_zoned || bdev_is_zoned(dc->flush.dev->bdev); > > out: > - if (max_delay < 50) { > - /* > - * In case of small requested delays, use kthread instead of > - * timers and workqueue to achieve better latency. > - */ > + /* > + * In case of small requested delays, use kthread instead of timers and > + * workqueue to achieve better latency. Also use a kthread for a zoned > + * device so that we can preserve the order of write operations during > + * suspend. > + */ > + if (max_delay < 50 || is_zoned) { > dc->worker = kthread_run(&flush_worker_fn, dc, > "dm-delay-flush-worker"); > if (IS_ERR(dc->worker)) { > ret = PTR_ERR(dc->worker); > @@ -313,8 +319,15 @@ static int delay_bio(struct delay_c *dc, struct > delay_class *c, struct bio *bio) > > spin_lock(&dc->delayed_bios_lock); > if (unlikely(!dc->may_delay)) { > - spin_unlock(&dc->delayed_bios_lock); > - return DM_MAPIO_REMAPPED; > + /* > + * Issue the BIO immediately if the device is not zoned. FOr a > + * zoned device, preserver the ordering of write operations by > + * using the delay list. > + */ > + if (!bdev_is_zoned(c->dev->bdev) || c != &dc->write) { > + spin_unlock(&dc->delayed_bios_lock); > + return DM_MAPIO_REMAPPED; > + } > } > c->ops++; > list_add_tail(&delayed->list, &dc->delayed_bios); > -- > 2.49.0