Read and write operations issued to a dm-crypt target may be split according to the dm-crypt internal limits defined by the max_read_size and max_write_size module parameters (default is 128 KB). The intent is to improve processing time of large BIOs by splitting them into smaller operations that can be parallelized on different CPUs.
For zoned dm-crypt targets, this BIO splitting is still done but without the parallel execution to ensure that the issuing order of write operations to the underlying devices remains sequential. However, the splitting itself causes other problems: 1) Since dm-crypt relies on the block layer zone write plugging to handle zone append emulation using regular write operations, the reminder of a split write BIO will always be plugged into the target zone write plugged. Once the on-going write BIO finishes, this reminder BIO is unplugged and issued from the zone write plug work. If this reminder BIO itself needs to be split, the reminder will be re-issued and plugged again, but that causes a call to a blk_queue_enter(), which may block if a queue freeze operation was initiated. This results in a deadlock as DM submission still holds BIOs that the queue freeze side is waiting for. 2) dm-crypt relies on the emulation done by the block layer using regular write operations for processing zone append operations. This still requires to properly return the written sector as the BIO sector of the original BIO. However, this can be done correctly only and only if there is a single clone BIO used for processing the original zone append operation issued by the user. If the size of a zone append operation is larger than dm-crypt max_write_size, then the orginal BIO will be split and processed as a chain of regular write operations. Such chaining result in an incorrect written sector being returned to the zone append issuer using the original BIO sector. This in turn results in file system data corruptions using xfs or btrfs. Fix this by modifying get_max_request_size() to always return the size of the BIO to avoid it being split with dm_accpet_partial_bio() in crypt_map(). get_max_request_size() is renamed to get_max_request_sectors() to clarify the unit of the value returned and its interface is changed to take a struct dm_target pointer and a pointer to the struct bio being processed. In addition to this change, to ensure that crypt_alloc_buffer() works correctly, set the dm-crypt device max_hw_sectors limit to be at most BIO_MAX_VECS << PAGE_SECTORS_SHIFT (1 MB with a 4KB page architecture). This forces DM core to split write BIOs before passing them to crypt_map(), and thus guaranteeing that dm-crypt can always accept an entire write BIO without needing to split it. This change does not have any effect on the read path of dm-crypt. Read operations can still be split and the BIO fragments processed in parallel. There is also no impact on the performance of the write path given that all zone write BIOs were already processed inline instead of in parallel. This change also does not affect in any way regular dm-crypt block devices. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Cc: sta...@vger.kernel.org Signed-off-by: Damien Le Moal <dlem...@kernel.org> --- drivers/md/dm-crypt.c | 49 ++++++++++++++++++++++++++++++++++--------- 1 file changed, 39 insertions(+), 10 deletions(-) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index 17157c4216a5..4e80784d1734 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -253,17 +253,35 @@ MODULE_PARM_DESC(max_read_size, "Maximum size of a read request"); static unsigned int max_write_size = 0; module_param(max_write_size, uint, 0644); MODULE_PARM_DESC(max_write_size, "Maximum size of a write request"); -static unsigned get_max_request_size(struct crypt_config *cc, bool wrt) + +static unsigned get_max_request_sectors(struct dm_target *ti, struct bio *bio) { + struct crypt_config *cc = ti->private; unsigned val, sector_align; - val = !wrt ? READ_ONCE(max_read_size) : READ_ONCE(max_write_size); - if (likely(!val)) - val = !wrt ? DM_CRYPT_DEFAULT_MAX_READ_SIZE : DM_CRYPT_DEFAULT_MAX_WRITE_SIZE; - if (wrt || cc->used_tag_size) { - if (unlikely(val > BIO_MAX_VECS << PAGE_SHIFT)) - val = BIO_MAX_VECS << PAGE_SHIFT; - } - sector_align = max(bdev_logical_block_size(cc->dev->bdev), (unsigned)cc->sector_size); + bool wrt = op_is_write(bio_op(bio)); + + if (wrt) { + /* + * For zoned devices, splitting write operations creates the + * risk of deadlocking queue freeze operations with zone write + * plugging BIO work when the reminder of a split BIO is + * issued. So always allow the entire BIO to proceed. + */ + if (ti->emulate_zone_append) + return bio_sectors(bio); + + val = min_not_zero(READ_ONCE(max_write_size), + DM_CRYPT_DEFAULT_MAX_WRITE_SIZE); + } else { + val = min_not_zero(READ_ONCE(max_read_size), + DM_CRYPT_DEFAULT_MAX_READ_SIZE); + } + + if (wrt || cc->used_tag_size) + val = min(val, BIO_MAX_VECS << PAGE_SHIFT); + + sector_align = max(bdev_logical_block_size(cc->dev->bdev), + (unsigned)cc->sector_size); val = round_down(val, sector_align); if (unlikely(!val)) val = sector_align; @@ -3496,7 +3514,7 @@ static int crypt_map(struct dm_target *ti, struct bio *bio) /* * Check if bio is too large, split as needed. */ - max_sectors = get_max_request_size(cc, bio_data_dir(bio) == WRITE); + max_sectors = get_max_request_sectors(ti, bio); if (unlikely(bio_sectors(bio) > max_sectors)) dm_accept_partial_bio(bio, max_sectors); @@ -3733,6 +3751,17 @@ static void crypt_io_hints(struct dm_target *ti, struct queue_limits *limits) max_t(unsigned int, limits->physical_block_size, cc->sector_size); limits->io_min = max_t(unsigned int, limits->io_min, cc->sector_size); limits->dma_alignment = limits->logical_block_size - 1; + + /* + * For zoned dm-crypt targets, there will be no internal splitting of + * write BIOs to avoid exceeding BIO_MAX_VECS vectors per BIO. But + * without respecting this limit, crypt_alloc_buffer() will trigger a + * BUG(). Avoid this by forcing DM core to split write BIOs to this + * limit. + */ + if (ti->emulate_zone_append) + limits->max_hw_sectors = min(limits->max_hw_sectors, + BIO_MAX_VECS << PAGE_SECTORS_SHIFT); } static struct target_type crypt_target = { -- 2.49.0