Bug#1031131: open-iscsi: Lot of iscsi/kernel errors in dmesg with Fujitsu Eternus DX100S4 connected with 2 10Gb ethernet paths with multipathd.

Milan Oravec Mon, 27 Feb 2023 11:27:23 -0800

hello Sarraf,

On Thu, Feb 23, 2023 at 2:44 PM Ritesh Raj Sarraf <r...@debian.org> wrote:


> On Thu, 2023-02-23 at 14:19 +0100, Milan Oravec wrote:
> > > >
> > >
> > > By KVM, do you mean just pure KVM ? Or the management suite too,
> > > libvirt ?
> > >
> >
> >
> > Yes, libvirt is used to manage KVM, and virt-manager on my desktop to
> > connect to it. It is simple setup without clustering.
> >
>
> Thanks for confirming.
>
> You should start with the hypervisor where libvirtd is running. Check
> if that hypervisor host is running normal.
>

No unusual errors reported.


>
> From all the symptoms you've shared, my wild arrow in the dark is that
> your hypervisor/guest network may be running into issues.
>
>
There is all ok with network.


> >
> >
> > Do you know someone who can help with this? Here is example of KVM
> > guest running configuration:
> >
> > libvirt+ 244533      1  5 Feb03 ?        1-03:54:37 qemu-system-
> > x86_64 -enable-kvm -name guest=me_test,process=qemu:me_test,debug-
> > threads=on -S -object
> > secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-
> > 108-me_test/master-key.aes -machine pc-1.1,accel=kvm,usb=off,dump-
> > guest-core=off -cpu host -m 8096 -realtime mlock=off -smp
> > 4,sockets=1,cores=4,threads=1 -uuid 1591f345-96b5-4077-9d32-
> > b2991003753d -no-user-config -nodefaults -chardev
> > socket,id=charmonitor,fd=57,server,nowait -mon
> > chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-
> > shutdown -boot strict=on -device piix3-usb-
> > uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-
> > pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive if=none,id=drive-
> > ide0-1-0,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-
> > ide0-1-0,id=ide0-1-0,bootindex=1 -drive
> > file=/dev/mapper/me_test,format=raw,if=none,id=drive-virtio-
> > disk0,cache=none,discard=unmap,aio=native -device virtio-blk-
> > pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-
> > disk0,bootindex=2,write-cache=on -netdev
> > tap,fd=60,id=hostnet0,vhost=on,vhostfd=61 -device virtio-net-
> > pci,netdev=hostnet0,id=net0,mac=aa:bb:cc:00:10:31,bus=pci.0,addr=0x3
> > -chardev pty,id=charserial0 -device isa-
> > serial,chardev=charserial0,id=serial0 -chardev
> > spicevmc,id=charchannel0,name=vdagent -device
> > virtserialport,bus=virtio-
> > serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice
> > .0 -spice port=5929,addr=0.0.0.0,disable-ticketing,seamless-
> > migration=on -device qxl-
> > vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,v
> > gamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device virtio-balloon-
> > pci,id=balloon0,bus=pci.0,addr=0x5 -sandbox
> > on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=de
> > ny -msg timestamp=on
> >
> > Maybe there is something undesirable for the iscsi target.
> >
>
> Seems like you are using host networking ? Have you validated that that
> part is working reliably ?
>

Yes, this is working fine too.

I've made some experiments with storage parameters passed to KVM, standard
disk created within virt-manager is running with following paramaters:

format=raw,cache=none,discard=unmap,aio=native.

I've removed cache  parameter (let hypervisor set its default) and now
running disk with:
format=raw,discard=unmap.

With this setup, there are no errors (connection1:0: pdu (op 0x5 itt 0x48)
rejected. Reason code 0x9) in logs. Virtual host was stressed with:

ns2:/home/migo# stress-ng -v -d 1
stress-ng: debug: [858] 2 processors online, 2 processors configured
stress-ng: info:  [858] defaulting to a 86400 second (1 day, 0.00 secs) run
per stressor
stress-ng: info:  [858] dispatching hogs: 1 hdd
stress-ng: debug: [858] cache allocate: default cache size: 262144K
stress-ng: debug: [858] starting stressors
stress-ng: debug: [858] 1 stressor started
stress-ng: debug: [859] stress-ng-hdd: started [859] (instance 0)
^Cstress-ng: debug: [859] stress-ng-hdd: exited [859] (instance 0)
stress-ng: debug: [858] process [859] terminated
stress-ng: info:  [858] successful run completed in 30377.75s (8 hours, 26
mins, 17.75 secs)
stress-ng: debug: [858] metrics-check: all stressor metrics validated and
sane

It seems that using hypervisors cache is filtering those bogus commands
(causing trouble for target) sent to the target. What do you think about it?

But running iscsi target with cache is not a good idea, and this is no real
solution. :(



>
> > >
> > > > > There will be errors in your system journal for this particular
> > > > > setup.
> > > > >
> > > > > Errors like:
> > > > >
> > > > > * connection drops
> > > > > * iscsi session drops/terminations
> > > > > * SCSI errors
> > > > > * multipath path checker errors
> > > > >
> > > > > All these will be errors which will be recovered eventually.
> > > > > That
> > > > > is
> > > > > why we have the need for close integration in between these
> > > > > layers,
> > > > > when building a storage solution on top.
> > > > >
> > > >
> > > >
> > > > This is very complex ecosystem. I know that error reporting is
> > > > good
> > > > think :) and helping out to troubleshoot problems. But when
> > > > everything is all right there should be no errors, right?
> > > >
> > >
> > > In an ideal scenario, yes, there will be no errors. But on a SAN
> > > setup,
> > > cluster failovers are a feature of the SAN target, and as such
> > > during
> > > that transition some errors are expected on the initiator, which
> > > are
> > > eventually recovered.
> > >
> > > Recovery is the critical part here. When states do not recover to
> > > normal, it is an error; either the target or the initiator. Or even
> > > the
> > > middleman (network) at times.
> > >
> >
> >
> > This part seems to work OK, so far no data loss was detected.
> >
>
> From your initial logs, mpath reported 'io pending' on one of the
> paths. I'm not sure if I'd call that fully recovered.
>

It recovers within 10-15s and path reporting 'io pending' recovers to
normal operation back.


>
> What if you down the other (healthy) path ? Does IO progress on the
> initial (io pending) path ?
>

I've tested disconnecting the link from the first path and second path
immediately taken over all operations. Reconnecting that path and all
operations moved to the original path with higher priority.

test-app (3600000e00d2c0000002c1772001f0000) dm-97 FUJITSU,ETERNUS_DXL
size=500G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| `- 1:0:0:28 sdar 66:176 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  `- 2:0:0:28 sdbj 67:208 active ready running



>
> > > >
> > > > Are you running KVM virtualization atop of your SAN target?
> > > >
> > >
> > > My LIO target runs in a KVM guest. So does the iSCSI initiator too.
> > >
> >
> >
> > Pure KVM or libvirtd too?
> >
>
> libvirtd it is. The same hypervisor with libvirtd running, hosts the
> initiator and the target, in different VMs.
>
>
> --
> Ritesh Raj Sarraf | http://people.debian.org/~rrs
> Debian - The Universal Operating System
>

Thank you, kind regards,

Milan

Bug#1031131: open-iscsi: Lot of iscsi/kernel errors in dmesg with Fujitsu Eternus DX100S4 connected with 2 10Gb ethernet paths with multipathd.

Reply via email to