Bug#1031131: open-iscsi: Lot of iscsi/kernel errors in dmesg with Fujitsu Eternus DX100S4 connected with 2 10Gb ethernet paths with multipathd.

Ritesh Raj Sarraf Thu, 23 Feb 2023 04:39:16 -0800

Hello Milan,

On Mon, 2023-02-20 at 09:51 +0100, Milan Oravec wrote:
> > >  
> > 
> > So did you trigger a cluster takeover ? My guess is it is your
> > target
> > initiating the connection drops, while taking over to the other
> > node.
> > 
> > 
> 
> 
> There was no intentional cluster takeover. This happens during normal
> operation. 
>  
> > How a target behaves during the transition is left to the target.
> > The
> > initiator will keep querying for recovery, until either it times
> > out or
> > recovers.
> > 
> 
> 
> recovery seems to work fine. I've tested it by disconnecting one path
> to target and all was OK, second path was used and when first one
> recovered is switched to that one over. 
>



> > 
> > > > > 
> 
> I can understand this is very complex problem, what do you suggest me
> to debug this issues? We have 5 host connected to Fujitsu target, but
> errors are only on those two which are running KVM guests. Other
> servers running mail hosts and camera surveillance are not affected
> it seems. They are Debian systems but not running multipath, so I've
> made test:
> 
> So far I've disabled all KVM guests on one (of those two KVM hosts
> used fro virtualization) host and mounted  one KVM guest file system
> locally (/mnt) and performed stress test on that mount:
> 

By KVM, do you mean just pure KVM ? Or the management suite too,
libvirt ?


> 
> When I let run one KVM guest (Debian) on this host system following
> errors are in dmesg:
> 
> [Mon Feb 20 09:49:39 2023]  connection1:0: pdu (op 0x5 itt 0x53)
> rejected. Reason code 0x9
> [Mon Feb 20 09:49:41 2023]  connection1:0: pdu (op 0x5 itt 0x7c)
> rejected. Reason code 0x9
> [Mon Feb 20 09:49:41 2023]  connection1:0: pdu (op 0x5 itt 0x51)
> rejected. Reason code 0x9
> [Mon Feb 20 09:50:07 2023]  connection1:0: pdu (op 0x5 itt 0x33)
> rejected. Reason code 0x9
> 
> Guest is using Virtio disk drivers (vda) so I've switched it to
> generic SATA (sda) for this guest, but errors in log remains.
> 
> It seem that KVM is triggering this errors and make our NAS unstable.
> 
> What can KVM do differently? It is using /dev/mapper/dm-XX as disk
> devices so no direct iscsi access. 
> 
> Any ideas what should I try next? 
> 
> 

I'm afraid I do not have much direct hints for you here. given that
this issue does not happen when KVM is not involved, would imply that
the erratic behavior originates from that software's integration.


> > There will be errors in your system journal for this particular
> > setup.
> > 
> > Errors like:
> > 
> > * connection drops
> > * iscsi session drops/terminations
> > * SCSI errors
> > * multipath path checker errors
> > 
> > All these will be errors which will be recovered eventually. That
> > is
> > why we have the need for close integration in between these layers,
> > when building a storage solution on top.
> > 
> 
> 
> This is very complex ecosystem. I know that error reporting is good
> think :) and helping out to troubleshoot problems. But when
> everything is all right there should be no errors, right?
>  

In an ideal scenario, yes, there will be no errors. But on a SAN setup,
cluster failovers are a feature of the SAN target, and as such during
that transition some errors are expected on the initiator, which are
eventually recovered.

Recovery is the critical part here. When states do not recover to
normal, it is an error; either the target or the initiator. Or even the
middleman (network) at times.

> > 
> > 
> > 
> > Note: These days I only have a software LIO target to test/play
> > with,
> > where I have not seen any real issues/errors. How each SAN Target
> > behaves is something highly specific to the target, in your case
> > the
> > Fujitsu target.
> > 
> > 
> 
> 
> Are you running KVM virtualization atop of your SAN target?
>  

My LIO target runs in a KVM guest. So does the iSCSI initiator too.


-- 
Ritesh Raj Sarraf | http://people.debian.org/~rrs
Debian - The Universal Operating System

signature.asc
Description: This is a digitally signed message part

Bug#1031131: open-iscsi: Lot of iscsi/kernel errors in dmesg with Fujitsu Eternus DX100S4 connected with 2 10Gb ethernet paths with multipathd.

Reply via email to