When a dm-delay device is being suspended, the .presuspend() operation is first executed (delay_presuspend()) to immediately issue all the BIOs present in the delayed list of the device and also sets the device may_delay boolean to false. At the same time, if any new BIO is issued to the device will not be delayed and immediately issued with delay_bio() returning DM_MAPIO_REMAPPED. This creates a situation where potentially 2 different contexts may be issuing write BIOs to the same zone of a zone device without respecting the issuing order from the user, that is, a newly issued write BIO may be issued before other write BIOs for the same target zone that are in the device delayed list. If such situation occurs, write BIOs may be failed by the underlying zoned device due to an unaligned write error.
Prevent this situation from happening by always handling newly issued write BIOs using the delayed list of BIOs, even when the device is being suspended. This is done by forcing the use of the worker kthread for zoned devices, and by modifying flush_worker_fn() to always flush all delayed BIOs if the device may_delay boolean is false. Reported-by: Benjamin Marzinski <bmarz...@redhat.com> Fixes: d43929ef65a6 ("dm-delay: support zoned devices") Signed-off-by: Damien Le Moal <dlem...@kernel.org> --- Changes from v1: - Fixed typo in commit message - Added reported-by tag drivers/md/dm-delay.c | 29 +++++++++++++++++++++-------- 1 file changed, 21 insertions(+), 8 deletions(-) diff --git a/drivers/md/dm-delay.c b/drivers/md/dm-delay.c index d4cf0ac2a7aa..c665b2ab1115 100644 --- a/drivers/md/dm-delay.c +++ b/drivers/md/dm-delay.c @@ -128,7 +128,7 @@ static int flush_worker_fn(void *data) struct delay_c *dc = data; while (!kthread_should_stop()) { - flush_delayed_bios(dc, false); + flush_delayed_bios(dc, !dc->may_delay); spin_lock(&dc->delayed_bios_lock); if (unlikely(list_empty(&dc->delayed_bios))) { set_current_state(TASK_INTERRUPTIBLE); @@ -213,6 +213,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int argc, char **argv) struct delay_c *dc; int ret; unsigned int max_delay; + bool is_zoned = false; if (argc != 3 && argc != 6 && argc != 9) { ti->error = "Requires exactly 3, 6 or 9 arguments"; @@ -236,6 +237,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int argc, char **argv) if (ret) goto bad; max_delay = dc->read.delay; + is_zoned = bdev_is_zoned(dc->read.dev->bdev); if (argc == 3) { ret = delay_class_ctr(ti, &dc->write, argv); @@ -251,6 +253,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int argc, char **argv) if (ret) goto bad; max_delay = max(max_delay, dc->write.delay); + is_zoned = is_zoned || bdev_is_zoned(dc->write.dev->bdev); if (argc == 6) { ret = delay_class_ctr(ti, &dc->flush, argv + 3); @@ -263,13 +266,16 @@ static int delay_ctr(struct dm_target *ti, unsigned int argc, char **argv) if (ret) goto bad; max_delay = max(max_delay, dc->flush.delay); + is_zoned = is_zoned || bdev_is_zoned(dc->flush.dev->bdev); out: - if (max_delay < 50) { - /* - * In case of small requested delays, use kthread instead of - * timers and workqueue to achieve better latency. - */ + /* + * In case of small requested delays, use kthread instead of timers and + * workqueue to achieve better latency. Also use a kthread for a zoned + * device so that we can preserve the order of write operations during + * suspend. + */ + if (max_delay < 50 || is_zoned) { dc->worker = kthread_run(&flush_worker_fn, dc, "dm-delay-flush-worker"); if (IS_ERR(dc->worker)) { ret = PTR_ERR(dc->worker); @@ -313,8 +319,15 @@ static int delay_bio(struct delay_c *dc, struct delay_class *c, struct bio *bio) spin_lock(&dc->delayed_bios_lock); if (unlikely(!dc->may_delay)) { - spin_unlock(&dc->delayed_bios_lock); - return DM_MAPIO_REMAPPED; + /* + * Issue the BIO immediately if the device is not zoned. FOr a + * zoned device, preserver the ordering of write operations by + * using the delay list. + */ + if (!bdev_is_zoned(c->dev->bdev) || c != &dc->write) { + spin_unlock(&dc->delayed_bios_lock); + return DM_MAPIO_REMAPPED; + } } c->ops++; list_add_tail(&delayed->list, &dc->delayed_bios); -- 2.49.0