On Thu, May 22, 2025 at 03:08:03PM +0200, Kevin Wolf wrote: > When scsi-block is used on a host multipath device, it runs into the > problem that the kernel dm-mpath doesn't know anything about SCSI or > SG_IO and therefore can't decide if a SG_IO request returned an error > and needs to be retried on a different path. Instead of getting working > failover, an error is returned to scsi-block and handled according to > the configured error policy. Obviously, this is not what users want, > they want working failover. > > QEMU can parse the SG_IO result and determine whether this could have > been a path error, but just retrying the same request could just send it > to the same failing path again and result in the same error. > > With a kernel that supports the DM_MPATH_PROBE_PATHS ioctl on dm-mpath > block devices (queued in the device mapper tree for Linux 6.16), we can > tell the kernel to probe all paths and tell us if any usable paths > remained. If so, we can now retry the SG_IO ioctl and expect it to be > sent to a working path. > > Signed-off-by: Kevin Wolf <kw...@redhat.com> > --- > v2: > - Add a comment to explain retry scenarios [Stefan] > - Handle -EAGAIN returned for suspended devices [Ben] > > block/file-posix.c | 115 ++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 114 insertions(+), 1 deletion(-) > > diff --git a/block/file-posix.c b/block/file-posix.c > index ec95b74869..569f4ca472 100644 > --- a/block/file-posix.c > +++ b/block/file-posix.c > @@ -41,6 +41,7 @@ > > #include "scsi/pr-manager.h" > #include "scsi/constants.h" > +#include "scsi/utils.h" > > #if defined(__APPLE__) && (__MACH__) > #include <sys/ioctl.h> > @@ -72,6 +73,7 @@ > #include <linux/blkzoned.h> > #endif > #include <linux/cdrom.h> > +#include <linux/dm-ioctl.h> > #include <linux/fd.h> > #include <linux/fs.h> > #include <linux/hdreg.h> > @@ -138,6 +140,22 @@ > #define RAW_LOCK_PERM_BASE 100 > #define RAW_LOCK_SHARED_BASE 200 > > +/* > + * Multiple retries are mostly meant for two separate scenarios: > + * > + * - DM_MPATH_PROBE_PATHS returns success, but before SG_IO completes, > another > + * path goes down. > + * > + * - DM_MPATH_PROBE_PATHS failed all paths in the current path group, so we > have > + * to send another SG_IO to switch to another path group to probe the > paths in > + * it. > + * > + * Even if each path is in a separate path group (path_grouping_policy set to > + * failover), it's rare to have more than eight path groups - and even then > + * pretty unlikely that only bad path groups would be chosen in eight > retries. > + */ > +#define SG_IO_MAX_RETRIES 8 > + > typedef struct BDRVRawState { > int fd; > bool use_lock; > @@ -165,6 +183,7 @@ typedef struct BDRVRawState { > bool use_linux_aio:1; > bool has_laio_fdsync:1; > bool use_linux_io_uring:1; > + bool use_mpath:1; > int page_cache_inconsistent; /* errno from fdatasync failure */ > bool has_fallocate; > bool needs_alignment; > @@ -4264,15 +4283,105 @@ hdev_open_Mac_error: > /* Since this does ioctl the device must be already opened */ > bs->sg = hdev_is_sg(bs); > > + /* sg devices aren't even block devices and can't use dm-mpath */ > + s->use_mpath = !bs->sg; > + > return ret; > } > > #if defined(__linux__) > +#if defined(DM_MPATH_PROBE_PATHS) > +static bool sgio_path_error(int ret, sg_io_hdr_t *io_hdr) > +{ > + if (ret < 0) { > + switch (ret) { > + case -ENODEV: > + return true; > + case -EAGAIN: > + /* > + * The device is probably suspended. This happens while the dm > table > + * is reloaded, e.g. because a path is added or removed. This is > an > + * operation that should complete within 1ms, so just wait a bit > and > + * retry. > + * > + * If the device was suspended for another reason, we'll wait and > + * retry SG_IO_MAX_RETRIES times. This is a tolerable delay > before > + * we return an error and potentially stop the VM. > + */ > + qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, 1000000);
sgio_path_error() is missing coroutine_fn. Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>
signature.asc
Description: PGP signature