Re: Kernel panic w/ message request_threaded_irq -> qla2x00_request_irqs -> qla2x00_probe_one -> mod_timer

2019-05-03 Thread TomK

On 5/2/2019 10:00 PM, Laurence Oberman wrote:

On Sun, 2019-04-28 at 12:11 -0400, TomK wrote:

On 4/15/2019 10:26 PM, TomK wrote:

On 4/15/2019 3:35 PM, Laurence Oberman wrote:

On Mon, 2019-04-15 at 08:39 -0700, Bart Van Assche wrote:

On Mon, 2019-04-15 at 08:55 -0400, Laurence Oberman wrote:

On Sun, 2019-04-14 at 23:25 -0400, TomK wrote:

Hey All,

I'm getting a kernel panic on an Gigabyte GA-890XA-UD3
motherboard
that
I've got a QLE2464 card in as a target (FC).  The kernel
has
been
crashing / panicking in the last 1-2 months about once a
week.  Before
that, it was rock solid for 4-5 years.  I've upgraded to
kernel
4.18.19
but that hasn't made much of a difference.  Since the
message
includes
qla2x00_request_irqs I thought I would try here first.

Tried to get more info on this but:

1) Keyboard doesn't work and locks up when the panic
occurs.  No
USB
ports work.  Tried the PS/2 port but nothing.

2) Unable to capture a kdump.  Can't get to the kdump
vmcore due
to
1).

The two screenshots is pretty much all I can capture.
Tried
things
like
clocksource=rtc in the kernel parms and disabling hpet1
but
apparently I
haven't disabled it everywhere since it still shows up.

Wondering if anyone recognizes these messages or has any
idea
what
could
be the issue here?  Even a hint would be appreciated.

  
Hello Tom

I have had similar issues and reported them to
Himanshu@Cavium
I have kept all my target servers at kernel 4.5 as it been
the only
version that has always been stable.
If your motherboard has an NMI (virtual or physical) set all
of
these
in /etc/sysctl.conf
Run sysctl -a;dracut -f and reboot

kernel.nmi_watchdog = 1
kernel.panic_on_io_nmi = 1
kernel.panic_on_unrecovered_nmi =
kernel.unknown_nmi_panic = 1

When the issue shows up press the virtual/physical NMI

This is with the assumption that generic kdump is properly
setup
and
dmesg | grep crash shows memory resrved by the crashkernel
and that
you
have tested kdump manually.

Other options are use a USB serial port to capture the full
log if
you
cannot get kdump to work.
  
That approach may provide further evidence about kernel bugs

but it
is not
guaranteed that that approach will lead to a solution. It would
help
if
either or both of you could do the following on a test system:
* Check out branch qla2xxx-for-next of my kernel repo on
github
(https://github.com/bvanassche/linux/tree/qla2xxx-for-next).
  
* Enable lockdep and KASAN in the kernel config

(CONFIG_PROVE_LOCKING
and
CONFIG_KASAN).
* Build and install that kernel.
* Run your favorite workload.

Please note that the qla2xxx-for-next branch is based on the
v5.1-rc1
kernel
and hence should not be installed on any production system.

Thanks,

Bart.
  
Hello Bart

OK, I will get to this by Thursday, wont be able to change the
targetserver kernel until then.
Regards
Laurence

  
Same.  I'll try this out closer to the weekend.


Not an NMI motherboard.  This is a 9-10 year old AMD board meant as
a desktop or home server.

I'll have to read more about the USB Serial port to capture further
info.  That's interesting.

For the time being, I've disabled HPET in BIOS.  ( Appears the
kernel boot parameter method wasn't enough. )




Hey Guy's,
Did some of what you suggested, including the USB serial setup:
1) One of DB9 RS232 Serial Null Modem Cable F/F
2) Two of USB to RS232 Serial Port DB9 9 Pin Male
however, when the kernel came down it took the USB support with it
and so minicom went offline:
  CTRL-A Z for help |115200 8N1 | NOR | Minicom 2.6.2  | VT102 |
Offline
But I did enable full logging for the QLA module:
echo 0x7fff >
/sys/module/qla2xxx/parameters/ql2xextended_error_logging
Did all that, minus the Kernel v5.1-rc1 implementation, and this is
what was picked up from the minicom USB to Serial capture before
things went south:
1235905 ^Mqla2xxx [:04:00.0]-e818: is_send_status=1, cmd-

bufflen=512, cmd->sg_cnt=1, cmd-


dma_data_directi
  
  on=1

se_cmd[
  
 9c9ea758]

qp
  
 0

1235906 ^Mqla2xxx [:04:00.0]-e818: is_send_status=1, cmd-

bufflen=4096, cmd->sg_cnt=0,

cmd-
   

dma_data_direct
  
  ion=2

se_cmd[000
  
096ae11b7]

q
  
   p 0

1235907 ^Mqla2xxx [:04:00.0]-e818: is_send_status=1, cmd-

bufflen=20480, cmd->sg_cnt=0,

cmd
  
->dma_data_direc

Re: [PATCH 0/4] lpfc updated for 12.2.0.2

2019-05-03 Thread Bart Van Assche
On 5/1/19 10:59 AM, James Smart wrote:
> Update lpfc to revision 12.2.0.2
> 
> A quick patch set that resolves lockdep checking issues and
> addresses a couple of bugs found when inspecting the paths
> for the lockdeps.
> 
> The patches were cut against Martin's 5.2/scsi-queue tree

Hi James,

While testing this patch series I hit the kernel warning shown below. Is
this kernel warning perhaps a regression due to patch 1/4 in this series?

Thanks,

Bart.


lpfc :00:0a.0: 1:(0):2753 PLOGI failure DID:01 Status:x3/x103
WARNING: CPU: 0 PID: 178 at drivers/scsi/lpfc/lpfc_sli.c:2994
lpfc_sli_iocbq_lookup+0x1aa/0x1c0 [lpfc]
CPU: 0 PID: 178 Comm: lpfc_worker_1 Tainted: G   O
5.1.0-rc7-dbg+ #3
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
RIP: 0010:lpfc_sli_iocbq_lookup+0x1aa/0x1c0 [lpfc]
Call Trace:
 lpfc_sli_sp_handle_rspiocb+0x43b/0xa30 [lpfc]
 ? check_chain_key+0x13f/0x200
 ? lpfc_sli_handle_fast_ring_event+0x7d0/0x7d0 [lpfc]
 ? memcpy+0x45/0x50
 ? lpfc_sli4_iocb_param_transfer+0xf7/0x420 [lpfc]
 lpfc_sli_handle_slow_ring_event_s4+0x252/0x320 [lpfc]
 lpfc_sli_handle_slow_ring_event+0x32/0x40 [lpfc]
 lpfc_do_work+0x1050/0x19f0 [lpfc]
 ? mark_held_locks+0x34/0xb0
 ? lpfc_unregister_unused_fcf+0xb0/0xb0 [lpfc]
 ? finish_wait+0x110/0x110
 ? _raw_spin_unlock_irqrestore+0x42/0x70
 ? __kthread_parkme+0xb5/0xd0
 kthread+0x1d2/0x1f0
 ? lpfc_unregister_unused_fcf+0xb0/0xb0 [lpfc]
 ? kthread_create_on_node+0xc0/0xc0
 ret_from_fork+0x24/0x50
irq event stamp: 50416
hardirqs last  enabled at (50415): []
_raw_spin_unlock_irqrestore+0x57/0x70
hardirqs last disabled at (50416): []
_raw_spin_lock_irqsave+0x18/0x60
softirqs last  enabled at (44458): []
__do_softirq+0x451/0x5b7
softirqs last disabled at (7): [] irq_exit+0xdd/0x100


Re: [PATCH 24/24] osst: add a SPDX tag to osst.c

2019-05-03 Thread Hannes Reinecke

On 5/2/19 9:55 PM, Willem Riede wrote:
On Thu, May 2, 2019 at 7:19 AM Hannes Reinecke > wrote:


On 5/2/19 2:53 PM, Christoph Hellwig wrote:
 > On Thu, May 02, 2019 at 08:06:38AM +0200, Hannes Reinecke wrote:
 >> On 5/1/19 6:14 PM, Christoph Hellwig wrote:
 >>> osst.c is the only osst file missing licensing information.  Add a
 >>> GPLv2 tag for the default kernel license.
 >>>
 >>> Signed-off-by: Chriosstoph Hellwig mailto:h...@losst.de>>
 >
 > FYI, my s/st/osst/ on the commit message message up my signoff, this
 > should be:
 >
 > Signed-off-by: Christoph Hellwig mailto:h...@lst.de>>
 >
Maybe it's time to kill osst.c for good ...


Yes. I've been thinking about doing just that. The devices it supports 
are now thoroughly obsolete. The manufacturer has gone out of business. 
All my test drives have broken down over time, so I can't even test any 
changes any more.



Just when I thought to reach out to you :-)

Thing is, we've done numerous changes to the 'st' driver in the course 
of the years, most of which seem to have avoided osst :-(


So what's your suggestion here?
Just drop it completely?
Or can we somehow fold the OnStream-specific things back into st.c?

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)


Re: [PATCH 0/5] block/target queue/LUN reset support

2019-05-03 Thread Hannes Reinecke

On 5/2/19 11:29 PM, Brian King wrote:

On 6/1/16 1:05 AM, Hannes Reinecke wrote:

On 05/31/2016 09:56 PM, Mike Christie wrote:

On 05/30/2016 01:37 AM, Hannes Reinecke wrote:

On 05/25/2016 09:54 AM, mchri...@redhat.com wrote:

Currently, for SCSI LUN_RESETs the target layer can only wait
on bio/requests it has sent. This normally results in the
LUN_RESET timing out on the initiator side and that SCSI error
handler escalating to something more disruptive.

To fix this, the following patches add a block layer helper and
callout to reset a request queue which the target layer can use
to force drivers to complete/fail executing requests.

Patches were made over Jens's block tree's for-next branch.


In general I like the approach, it just looks as if the main aim
(ie running a LUN RESET concurrent with normal I/O on other
devices) is not quite reached.

The general concept of eh_async_device_reset() is quite nice, and
renaming existing functions for doing so is okay, too.

It's just the integration with SCSI EH which is somewhat
deficient (as outlined in the comment on patch 3). For the async
device reset to work we'd need to call it _before_ SCSI EH is
started, ie after the asynchronous command abort failed.


Yes that is my plan.

However, these first patches are only to allow LIO to be able to do
resets. I need the same infrastructure for both though.



The easiest way would be to add per-device reset workqueue item,
  which wold be called whenever command abort failed.


If you want to do this without stopping the entire host, you need
the patches like in this set where we stop and flush a queue.


Sure.


As it's being per device we'd be getting an implicit
serialisation, and we could skip the lun reset from EH.


To build on my patches for a new async based scsi eh what we want
to do is:

0. Add eh_async_target_reset callout which works like async device
reset one. For iscsi this maps to iscsi_eh_session_reset. FC
drivers have something similar in the code paths that call
rc_remote_port_delete and the terminate_rport_io paths. We just
need wrappers.


Actually, I was wondering whether we could layer the new async EH
infrastructure besides the original EH.

And the current 'target_reset' is completely wrong. SAM-2 did away
with the TARGET RESET TMF, so it's anyones guess if a target reset
is actually _implemented_. What we really need, though, is a new
'eh_async_transport_reset' function, which would reset the
_transport_. A transport failure is currently main (and I'm even
tempted to say the only) reason why EH is invoked.


1. scsi_times_out would kick off abort if needed and return
BLK_EH_RESET_TIMEOUT. 2. If abort fails, cancel queued aborts and
call new async device reset callout in these patches. 3. If device
reset fails call new async target reset callout. 4. if target
reset fails, let fail the block timeout timer and do the old style
scsi eh host reset.


I would suggest to replace 3. and 4. with:

3. If device reset fails call the new async transport reset callout
4. If transport reset fails fallback to the original SCSI EH (which
would have abort and device reset callouts unset, so it'll start
with a target reset)

That way we keep the existing behaviour (so we don't need to touch
the zillions of SCSI parallel drivers) _and_ will be able to model a
  reasonably modern error handling.


It is really simple for newer drivers/classes like FC and iSCSI
because they handle the device and target/port level reset clean
up already. The difficult (not really difficult but messy) part is
trying to support old and new style EHs in a functions like
scsi_times_out and scsi_abort_command.


And indeed, that's the challenge. But your patchset is a step into
the right direction. I see if I can make progress with it, although
I'm currently busy doing the next release so it might take some
time.



Recently I've been looking at some issues we are seeing in the field with 
customers
that have very large storage configurations with lots and lots of SAS drives. 
We are seeing scenarios
where drive head failures and other issues are resulting in command aborts that 
then ultimately fail
and we then quiesce the HBA in order to do the LUN reset. Since this 
configuration has
hundreds of SAS disks under a single HBA, that results in a very noticeable I/O 
service time
problem for all the other disks under that HBA due to one misbehaving drive. 
We've so far
focused most our efforts on getting other components in the stack to behave 
differently
in order to mitigate the issue. However, that doesn't mean we can't do better
in the kernel.

The direction this patch set was headed was to implement async LUN reset, 
something we've
discussed for years, but never fully implemented.  Is this something anyone 
else is still
seeing as an issue for them in other environments? Given that the last attempt 
at implementing
this, from what I can tell, happened now three years ago and then stalled, I'm 
afraid
I know the answer, but