On 5/15/26 18:52, Vjaceslavs Klimovs wrote: > Summary > ------- > On v6.18, starting a libvirt/QEMU guest with virtio-blk backed by an > LVM "--type raid1" LV (drivers/md/dm-raid.c stacked on > drivers/md/raid1.c) makes md/raid1 register read failures at LV > sector 0 within seconds of "virsh start" and mark rimage_0 Faulty > once max_corrected_read_errors (default 20) is exceeded. Reads > succeed via the redirect path so guests boot, but every guest disk > ends up degraded on every VM start. Same workload on legacy > "--type mirror" (drivers/md/dm-raid1.c) crashes the host: a > zero-length READ reaches the NVMe controller, is rejected with > "Invalid Field in Command", and the dm-mirror recovery path oopses.
That sounds somewhat like https://lore.kernel.org/all/2982107.4sosBPzcNG@electra/ Have you tried latest 7.1-rc? It contains a fix for the problem mentioned in said thread: f7b24c7b41f23b ("md/raid1,raid10: don't fail devices for invalid IO errors") [v7.1-rc2] Ciao, Thorsten > Symptom on dm-raid raid1 (post --type raid1) > -------------------------------------------- > Per LV, at virsh start, in host dmesg: > > kernel: raid1_end_read_request: 95 callbacks suppressed > kernel: raid1_read_request: 95 callbacks suppressed > kernel: md/raid1:mdX: dm-58: rescheduling sector 0 > kernel: md/raid1:mdX: redirecting sector 0 to other mirror: dm-58 > kernel: md/raid1:mdX: dm-58: rescheduling sector 0 > kernel: md/raid1:mdX: redirecting sector 0 to other mirror: dm-58 > [... 10 rescheduling/redirecting pairs ...] > kernel: md/raid1:mdX: dm-58: Raid device exceeded read_error > threshold [cur 21:max 20] > kernel: md/raid1:mdX: dm-58: Failing raid device > kernel: md/raid1:mdX: Disk failure on dm-58, disabling device. > kernel: md/raid1:mdX: Operation continuing on 1 devices. > > dmeventd: WARNING: Device #0 of raid1 array, vg0-iris_boot, has failed. > dmeventd: WARNING: Waiting for resynchronization to finish before > initiating repair on RAID device vg0-iris_boot. > dmeventd: Use 'lvconvert --repair vg0/iris_boot' to replace failed device. > > Subsequent "lvs -a": > > WARNING: RaidLV vg0/iris_boot needs to be refreshed! > See character 'r' at position 9 in the RaidLV's attributes and its SubLV(s). > > dmesg | grep nvme is EMPTY on this path. The NVMe driver is not > involved in producing the error; the failure originates between the > virtio-blk bio submission and raid1_end_read_request(). > > Symptom on legacy dm-mirror (pre-conversion --type mirror) > ---------------------------------------------------------- > Same workload on drivers/md/dm-raid1.c reaches the NVMe controller > as a zero-length READ and panics the host through dm-mirror's > recovery path: > > kernel: operation not supported error, dev nvme1n1, sector 935446535 > op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2 > kernel: nvme1n1: I/O Cmd(0x2) @ LBA 935446535, 0 blocks, I/O Error > (sct 0x0 / sc 0x2) > [... 10+ identical bursts at same timestamp ...] > dmeventd: Primary mirror device 252:58 read failed. > dmeventd: vg0-iris_boot is now in-sync. > [kernel oops in dm_mirror recovery path, full trace lost to console flash] > > The "phys_seg 0", "0 blocks", "sct 0x0/sc 0x2" trio (NVMe Generic, > Invalid Field in Command, NVMe spec 4.1.1.2) is unambiguous: a bio > with bi_iter.bi_size == 0 and bi_vcnt == 0 left the block layer and > hit the controller. dm-raid raid1 hides this by retrying on the > surviving leg, but the upstream-of-md trigger is identical. > > Bisect > ------ > git bisect, v6.12..v6.18, 16 deterministic GOOD/BAD steps, no skips, > ~104 minutes: > > 5ff3f74e145adc79b49668adb8de276446acf6be is the first bad commit > block: simplify direct io validity check > > --- a/block/fops.c > +++ b/block/fops.c > @@ -38,8 +38,8 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb) > static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb > *iocb, > struct iov_iter *iter) > { > - return iocb->ki_pos & (bdev_logical_block_size(bdev) - 1) || > - !bdev_iter_is_aligned(bdev, iter); > + return (iocb->ki_pos | iov_iter_count(iter)) & > + (bdev_logical_block_size(bdev) - 1); > } > > The dropped bdev_iter_is_aligned() used to walk the iov_iter and > reject per-segment misaligned/degenerate vectors at the blkdev fops > entry point. The replacement only validates ki_pos and total length > against the logical block size. Cases that now pass that no longer > get rejected: > > - iter with iov_iter_count(iter) == 0 (degenerate; total length is > "sector-aligned" since 0 % 512 == 0) > - iter where total length is sector-aligned but a segment isn't > > The commit message justifies the removal with "The block layer > checks all the segments for validity later". This is true for the > io_uring submit path (which enters __blkdev_direct_IO directly and > does its own validation) but not for the libaio aio_read/write_iter > or the worker-pool sync read/write_iter paths that enter via > blkdev_{read,write}_iter() -> blkdev_dio_invalid(). For those paths, > the segment check has no replacement. > > Reproducing > ---------------------------------------------------------- > > The trigger requires QEMU virtio-blk's specific submission shape AND > a non-io_uring submit. Userspace libaio alone, userspace > preadv-in-a-thread alone, and QEMU's raw-driver open probes (which > qemu-img info exercises identically) are all insufficient. The > combination that hits the bug is "guest-driven I/O through > virtio-blk-pci with cache.direct=on and aio in {native, threads}". > > #regzbot introduced: 5ff3f74e145adc79b49668adb8de276446acf6be > > Thanks, > Vjaceslavs Klimovs >

