On 5/12/25 17:18, Kevin Wolf wrote:
Am 08.05.2025 um 15:51 hat Martin Wilck geschrieben:
Hello Kevin,

[I'm sorry for the previous email. It seems that I clicked "send" in
the wrong window].

On Tue, 2025-04-29 at 18:50 +0200, Kevin Wolf wrote:
Multipath cannot directly provide failover for ioctls in the kernel
because it doesn't know what each ioctl means and which result could
indicate a path error. Userspace generally knows what the ioctl it
issued means and if it might be a path error, but neither does it
know
which path the ioctl took nor does it necessarily have the privileges
to
fail a path using the control device.

Thanks for this effort.

I have some remarks about your approach. The most important one is that
the DM_MPATH_PROBE_PATHS_CMD ioctl appears to be a dangerous command.
It sends IO to possibly broken paths and waits for it to complete. In
the common error case of a device not responding, this might cause the
code to hang for a long time in the kernel ioctl code path, in
uninterruptible sleep (note that these probing commands will be queued
after other possibly hanging IO). In the past we have put a lot of
effort into other code paths in multipath-tools and elsewhere to avoid
this kind of hang to the extent possible. It seems to me that your set
re-introduces this risk.

That's a fair point. I don't expect this code to be fully final, another
thing that isn't really optimal (as mentioned in the comment message) is
that we're not probing paths in parallel, potentially adding up timeouts
from multiple paths.

I don't think this is a problem of the interface, though, but we can
improve the implementation keeping the same interface.

Maybe I'm missing something, but I think the reason why it has to be
uninterruptible is just that submit_bio_wait() has the completion on the
stack? So I suppose we could just kmalloc() some state, submit all the
bios, let the completion callback free it, and then we can just stop
waiting early (i.e. wait_for_completion_interruptible/killable) and let
the requests complete on their own in the background.

Would this address your concerns or is there another part to it?

Not really. There are two issues which ultimately made us move away
from read I/O as path checkers:

- Spurious timeouts due to blocked or congested submission
- Error handling delays and stalls

When using normal read/write functions for path checking you are
subjected to normal I/O processing rules, ie I/O is inserted
at the end of the submission queue. If the system is busy the
path checker will time out prematurely without ever having reached
the target. One could avoid that by extending the timeout, but that
would make the path checker unreliable.

But the real issue is error handling. Path checker are precisely there
to check if the path is healthy, and as such _will_ be subjected
to error handling (or might even trigger error handling itself).
The thing about error handling is that you can only return affected
commands once error handling is complete, as only then you can be
sure that no DMA is pending on the buffers and one can free/re-use
the associated pages.
On SCSI error handling can be an _extremely_ lengthy affair
(we had reports for over one hour), and the last thing you want is
to delay your path checker for that time.

(And besides, I didn't even mention the third problem with I/O-based
path checkers, namely that the check entirely the wrong thing; we
are not interested whether we can do I/O, we are interested in whether
we can send commands. In the light of eg ALUA these are two very
different things.)

Apart from that, minor observations are that in patch 2/2 you don't
check whether the map is in queueing state, and I don't quite
understand why you only check paths in the map's active path group
without attempting a PG failover in the case where all paths in the
current PG fail.

Ben already fixed up the missing check.

Not attempting PG failover was also his suggestion, on the basis that it
would be additional work for no real benefit when you can only submit
requests for the current PG anyway. If userspace retries the SG_IO, it
will already pick a different PG, so having the kernel do the same
doesn't really improve anything.

Precisely.

I think the best way forward here is to have a 'post_ioctl' handler
(much like we have a pre_ioctl handler nowadays).
This would check the return code from the ioctl, and invalidate the
current path if there was an I/O error.
That way the user can resubmit the ioctl, until all paths are exhausted.
For that to work we need to agree on two error codes, one for
'path failed, retry' and one for 'path failed, no retry possible'.
Resetting path status will be delegated to multipathd, but then that's
its main task anyway.

Cheers,

Hannes
--
Dr. Hannes Reinecke                  Kernel Storage Architect
h...@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

Reply via email to