RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661]

2015-05-28 Thread Kashyap Desai
Bjorn/Robin,

Apologies for delay. Here is one quick suggestion as we have seen similar
issue (not exactly similar, but high probably to have same issue) while
controller is configured on VM as pass-through and VM reboot abruptly.
In that particular issue, driver interact with FW which may  require chip
reset to bring controller to operation state.

Relevant patch was submitted for only Older controller as it was only seen
for few MegaRaid controller. Below patch already try to do chip reset, but
only for limited controllers...I have attached one more patch which does
chip reset from driver load time for Thunderbolt/Invader/Fury etc. (In
your case you have Thunderbolt controller, so attached patch is required.)

http://www.spinics.net/lists/linux-scsi/msg67288.html

Please post the result with attached patch.

Thanks, Kashyap

> -Original Message-
> From: Bjorn Helgaas [mailto:bhelg...@google.com]
> Sent: Thursday, May 28, 2015 5:55 PM
> To: Robin H. Johnson
> Cc: Adam Radford; Neela Syam Kolli; linux-scsi@vger.kernel.org;
> arkadiusz.bub...@open-e.com; Matthew Garrett; Kashyap Desai; Sumit
Saxena;
> Uday Lingala; megaraidlinux@avagotech.com;
linux-...@vger.kernel.org;
> linux-ker...@vger.kernel.org; Jean Delvare; Myron Stowe
> Subject: Re: megaraid_sas: "FW in FAULT state!!", how to get more debug
> output? [BKO63661]
>
> [+cc Jean, Myron]
>
> Hello megaraid maintainers,
>
> Have you been able to take a look at this at all?  People have been
reporting this
> issue since 2012 on upstream, Debian, and Ubuntu, and now we're getting
> reports on SLES.
>
> My theory is that the Linux driver relies on some MegaRAID
initialization done by
> the option ROM, and the bug happens when the BIOS doesn't execute the
option
> ROM.
>
> If that's correct, you should be able to reproduce it on any system by
booting
> Linux (v3.3 or later) without running the MegaRAID SAS 2208 option ROM
(either
> by enabling a BIOS "fast boot" switch, or modifying the BIOS to skip
it).  If the
> Linux driver doesn't rely on the option ROM, you might even be able to
> reproduce it by physically removing the option ROM from the MegaRAID.
>
> Bjorn
>
> On Wed, Apr 29, 2015 at 12:28:32PM -0500, Bjorn Helgaas wrote:
> > [+cc linux-pci, linux-kernel, Kashyap, Sumit, Uday, megaraidlinux.pdl]
> >
> > On Sun, Jul 13, 2014 at 01:35:51AM +, Robin H. Johnson wrote:
> > > On Sat, Jul 12, 2014 at 11:29:20AM -0600, Bjorn Helgaas wrote:
> > > > Thanks for the report, Robin.
> > > >
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=63661 bisected the
> > > > problem to 3c076351c402 ("PCI: Rework ASPM disable code"), which
> > > > appeared in v3.3.  For starters, can you verify that, e.g., by
> > > > building
> > > > 69166fbf02c7 (the parent of 3c076351c402) to make sure that it
> > > > works, and building 3c076351c402 itself to make sure it fails?
> > > >
> > > > Assuming that's the case, please attach the complete dmesg and
> > > > "lspci -vvxxx" output for both kernels to the bugzilla.  ASPM is a
> > > > feature that is configured on both ends of a PCIe link, so I want
> > > > to see the lspci info for the whole system, not just the SAS
adapters.
> > > >
> > > > It's not practical to revert 3c076351c402 now, so I'd also like to
> > > > see the same information for the newest possible kernel (if this
> > > > is possible; I'm not clear on whether you can boot your system or
> > > > not) so we can figure out what needs to be changed.
> > > TL;DR: FastBoot is leaving the MegaRaidSAS in a weird state, and it
> > > fails to start; Commit 3c076351c402 did make it worse, but I think
> > > we're right that the bug lies in the SAS code.
> > >
> > > Ok, I have done more testing on it (40+ boots), and I think we can
> > > show the problem is somewhere in how the BIOS/EFI/ROM brings up the
> > > card in FastBoot more, or how it leaves the card.
> >
> > I attached your dmesg and lspci logs to
> > https://bugzilla.kernel.org/show_bug.cgi?id=63661, thank you!  You did
> > a huge amount of excellent testing and analysis, and I'm sorry that we
> > haven't made progress using the results.
> >
> > I still think this is a megaraid_sas driver bug, but I don't have
> > enough evidence to really point fingers.
> >
> > Based on your testing, before 3c076351c402 ("PCI: Rework ASPM disable
> > code"), megaraid_sas worked reliably.  After 3c076351c402,
> > megaraid_sas does not work

RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661]

2015-06-30 Thread Kashyap Desai
Jean,

Patch is available at below repo -

git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git - b for-next

Commit id -
6431f5d7c6025f8b007af06ea090de308f7e6881

If you share megaraid_sas driver code of your tree, I can provide patch for
you.

` Kashyap

> -Original Message-
> From: Jean Delvare [mailto:jdelv...@suse.de]
> Sent: Monday, June 29, 2015 6:55 PM
> To: Kashyap Desai
> Cc: Bjorn Helgaas; Robin H. Johnson; Adam Radford; Neela Syam Kolli;
> linux-
> s...@vger.kernel.org; arkadiusz.bub...@open-e.com; Matthew Garrett; Sumit
> Saxena; Uday Lingala; PDL,MEGARAIDLINUX; linux-...@vger.kernel.org; linux-
> ker...@vger.kernel.org; Myron Stowe
> Subject: RE: megaraid_sas: "FW in FAULT state!!", how to get more debug
> output? [BKO63661]
>
> Hi Kashyap,
>
> Thanks for the patch. May I ask what tree it was based on? Linus'
> latest? I am trying to apply it to the SLES 11 SP3 and SLES 12 kernel
> trees (based
> on kernel v3.0 + a bunch of backports and v3.12
> respectively) but your patch fails to apply in both cases. I'll try harder
> but I don't
> know anything about the megaraid_sas code so I really don't know where I'm
> going.
>
> Does your patch depend on any other that may not be present in the SLES
> 11 SP3 and SLES 12 kernels?
>
> Thanks,
> Jean
>
> Le Thursday 28 May 2015 à 19:05 +0530, Kashyap Desai a écrit :
> > Bjorn/Robin,
> >
> > Apologies for delay. Here is one quick suggestion as we have seen
> > similar issue (not exactly similar, but high probably to have same
> > issue) while controller is configured on VM as pass-through and VM
> > reboot
> abruptly.
> > In that particular issue, driver interact with FW which may  require
> > chip reset to bring controller to operation state.
> >
> > Relevant patch was submitted for only Older controller as it was only
> > seen for few MegaRaid controller. Below patch already try to do chip
> > reset, but only for limited controllers...I have attached one more
> > patch which does chip reset from driver load time for
> > Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt
> > controller, so attached patch is required.)
> >
> > http://www.spinics.net/lists/linux-scsi/msg67288.html
> >
> > Please post the result with attached patch.
> >
> > Thanks, Kashyap
>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661]

2015-07-07 Thread Kashyap Desai
> -Original Message-
> From: Jean Delvare [mailto:jdelv...@suse.de]
> Sent: Tuesday, July 07, 2015 2:14 PM
> To: Kashyap Desai
> Cc: Bjorn Helgaas; Robin H. Johnson; Adam Radford; Neela Syam Kolli;
linux-
> s...@vger.kernel.org; arkadiusz.bub...@open-e.com; Matthew Garrett;
Sumit
> Saxena; Uday Lingala; PDL,MEGARAIDLINUX; linux-...@vger.kernel.org;
linux-
> ker...@vger.kernel.org; Myron Stowe
> Subject: Re: megaraid_sas: "FW in FAULT state!!", how to get more debug
> output? [BKO63661]
>
> Hi Kashyap,
>
> On Thu, 28 May 2015 19:05:35 +0530, Kashyap Desai wrote:
> > Bjorn/Robin,
> >
> > Apologies for delay. Here is one quick suggestion as we have seen
> > similar issue (not exactly similar, but high probably to have same
> > issue) while controller is configured on VM as pass-through and VM
reboot
> abruptly.
> > In that particular issue, driver interact with FW which may  require
> > chip reset to bring controller to operation state.
> >
> > Relevant patch was submitted for only Older controller as it was only
> > seen for few MegaRaid controller. Below patch already try to do chip
> > reset, but only for limited controllers...I have attached one more
> > patch which does chip reset from driver load time for
> > Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt
> > controller, so attached patch is required.)
> >
> > http://www.spinics.net/lists/linux-scsi/msg67288.html
> >
> > Please post the result with attached patch.
>
> Good news! Customer tested your patch and said it fixed the problem :-)
>
> I am now in the process of backporting the patch to the SLES 11 SP3
kernel for
> further testing. I'll let you know how it goes. Thank you very much for
your
> assistance.

Thanks for confirmation. Whatever patch I submitted to you, we have added
recently (as part of common interface approach to do chip reset at load
time). We will be submitting that patch to mainline soon.

~ Kashyap
>
> --
> Jean Delvare
> SUSE L3 Support
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661]

2015-07-10 Thread Kashyap Desai
>
> I am about to commit the patch that was successfully tested by the
> customer on
> SLES 12, but I'm a bit confused. The upstream patch you referred to is:
>
> https://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?h=for-
> next&id=6431f5d7c6025f8b007af06ea090de308f7e6881
> [SCSI] megaraid_sas: megaraid_sas driver init fails in kdump kernel
>
> But the patch I used is the one you sent by e-mail on May 28th. It is
> completely
> different!
>
> So what am I supposed to do? Use the patch you sent (and that was tested
> by
> the customer) for SLES 11 SP3 and SLES 12? Or was it just for testing and
> the
> proper way of fixing the problem would be to backport the upstream commit?

You can use that patch as valid candidate for upstream submission. Some of
the MR maintainer (Sumit Saxena) will send that patch. We are just
organizing other patch series.
Since SLES already ported patch without commit id, we are fine. I am just
giving reference that patch which send via email will be send to upstream
very soon along with other patch set.

Thanks, Kashyap
>
> Please advise,
> --
> Jean Delvare
> SUSE L3 Support
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 4/7] megaraid_sas: Send SYNCHRONIZE_CACHE command to firmware

2016-10-17 Thread Kashyap Desai
> -Original Message-
> From: James Bottomley [mailto:j...@linux.vnet.ibm.com]
> Sent: Monday, October 17, 2016 10:50 PM
> To: Ric Wheeler; Hannes Reinecke; Sumit Saxena; linux-scsi@vger.kernel.org
> Cc: martin.peter...@oracle.com; the...@redhat.com;
> kashyap.de...@broadcom.com; Christoph Hellwig; Martin K. Petersen; Jeff
> Moyer; Gris Ge; Ewan Milne; Jens Axboe
> Subject: Re: [PATCH 4/7] megaraid_sas: Send SYNCHRONIZE_CACHE command
> to firmware
>
> On Mon, 2016-10-17 at 12:27 -0400, Ric Wheeler wrote:
> > On 10/17/2016 12:20 PM, James Bottomley wrote:
> > > On Mon, 2016-10-17 at 09:01 -0400, Ric Wheeler wrote:
> > > > On 10/17/2016 07:34 AM, Hannes Reinecke wrote:
> > > > > On 10/17/2016 12:24 PM, Sumit Saxena wrote:
> > > > > > megaraid_sas driver returns SYNCHRONIZE_CACHE command back to
> > > > > > SCSI layer without sending it to firmware as firmware takes
> > > > > > care of flushing cache.  This patch will change the driver
> > > > > > behavior wrt SYNCHRONIZE_CACHE handling. If underlying
> > > > > > firmware has support to handle the SYNCHORNIZE_CACHE, driver
> > > > > > will send it for firmware otherwise complete it back to SCSI
> > > > > > layer with SUCCESS immediately.  If  Firmware  handle
> > > > > > SYNCHORNIZE_CACHE for both VD and JBOD "canHandleSyncCache"
> > > > > > bit in scratch pad register(offset
> > > > > > 0x00B4) will be set.
> > > > > >
> > > > > > This behavior can be controlled via module parameter and user
> > > > > > can fallback to old behavior of returning SYNCHRONIZE_CACHE by
> > > > > > driver only without sending it to firmware.
> > > > > >
> > > > > > Signed-off-by: Sumit Saxena 
> > > > > > Signed-off-by: Kashyap Desai 
> > > > > > ---
> > > > > >drivers/scsi/megaraid/megaraid_sas.h|  3 +++
> > > > > >drivers/scsi/megaraid/megaraid_sas_base.c   | 14 ++---
> > > > > > -
> > > > > >drivers/scsi/megaraid/megaraid_sas_fusion.c |  3 +++
> > > > > >drivers/scsi/megaraid/megaraid_sas_fusion.h |  2 ++
> > > > > >4 files changed, 14 insertions(+), 8 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/scsi/megaraid/megaraid_sas.h
> > > > > > b/drivers/scsi/megaraid/megaraid_sas.h
> > > > > > index ca86c88..43fd14f 100644
> > > > > > --- a/drivers/scsi/megaraid/megaraid_sas.h
> > > > > > +++ b/drivers/scsi/megaraid/megaraid_sas.h
> > > > > > @@ -1429,6 +1429,8 @@ enum FW_BOOT_CONTEXT {
> > > > > >#define MR_MAX_REPLY_QUEUES_EXT_OFFSET_SHIFT14
> > > > > >#define MR_MAX_MSIX_REG_ARRAY   16
> > > > > >#define MR_RDPQ_MODE_OFFSET  0X00800
> > > > > > 000
> > > > > > +#define MR_CAN_HANDLE_SYNC_CACHE_OFFSET0X010
> > > > > > 
> > > > > > 0
> > > > > > +
> > > > > >/*
> > > > > >* register set for both 1068 and 1078 controllers
> > > > > >* structure extended for 1078 registers @@ -2140,6 +2142,7
> > > > > > @@ struct megasas_instance {
> > > > > > u8 is_imr;
> > > > > > u8 is_rdpq;
> > > > > > bool dev_handle;
> > > > > > +   bool fw_sync_cache_support;
> > > > > >};
> > > > > >struct MR_LD_VF_MAP {
> > > > > > u32 size;
> > > > > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> > > > > > b/drivers/scsi/megaraid/megaraid_sas_base.c
> > > > > > index 092387f..a4a8e2f 100644
> > > > > > --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> > > > > > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> > > > > > @@ -104,6 +104,10 @@ unsigned int scmd_timeout =
> > > > > > MEGASAS_DEFAULT_CMD_TIMEOUT;
> > > > > >module_param(scmd_timeout, int, S_IRUGO);
> > > > > >MODULE_PARM_DESC(scmd_timeout, "scsi command timeout (10
> > > > > > -90s), default 90s. See megasas_reset_timer.");
> > > > > >
> > > > > > +bool block_sync_cache;
> >

Device or HBA QD throttling creates holes in Sequential work load

2016-10-19 Thread Kashyap Desai
Hi,

I am doing some performance tuning in MR driver to understand how sdev
queue depth and hba queue depth play role in IO submission from above
layer.

I have 24 JBOD connected to MR 12GB controller and I can see performance
for 4K Sequential work load as below.

HBA QD for MR controller is 4065
Per device QD is set to 32

queue depth from  256 reports 300K IOPS
queue depth from  128 reports 330K IOPS
queue depth from  64 reports 360K IOPS
queue depth from  32 reports 510K IOPS


In MR driver I added debug print and confirm that more IO come to driver
as random IO whenever I have  queue depth more than 32.
I have debug using scsi logging level and blktrace as well. Below is
snippet of logs using scsi logging level.  In summary, if SML do flow
control of IO due to Device QD or HBA QD,  IO coming to LLD is more random
pattern.

I see IO coming to driver is not sequential.

[79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0
3b 00 00 01 00
[79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00 00 03 c0
3c 00 00 01 00
[79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB: Write(10) 2a 00 00 03 c0
5b 00 00 01 00 <- After 3c it jumps to 5b. Sequence are overlapped. Due to
sdev QD throttling.
[79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0
5c 00 00 01 00
[79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00 00 03 c0
3d 00 00 01 00
[79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB: Write(10) 2a 00 00 03 c0
5d 00 00 01 00
[79546.912259] sd 18:2:21:0: [sdy] tag#857 CDB: Write(10) 2a 00 00 03 c0
3e 00 00 01 00
[79546.912268] sd 18:2:21:0: [sdy] tag#858 CDB: Write(10) 2a 00 00 03 c0
3f 00 00 01 00


If scsi_request_fn() breaks due to unavailability of device queue (due to
below check), will there be any side defect as I observe ?

if (!scsi_dev_queue_ready(q, sdev))
break;

If I reduce HBA QD and make sure IO from above layer is throttled due to
HBA QD, there is a same impact.  MR driver use host wide shared tag map.

Can someone help me if this can be tunable in LLD providing additional
settings or it is expected behavior ? Problem I am facing is, I am not
able to figure out optimal device queue depth for different configuration
and work load.

` Kashyap
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Device or HBA QD throttling creates holes in Sequential work load

2016-10-19 Thread Kashyap Desai
Reply to see if email reached to linux-scsi@vger.kernel.org.


On Thu, Oct 20, 2016 at 12:07 PM, Kashyap Desai
 wrote:
> Hi,
>
> I am doing some performance tuning in MR driver to understand how sdev queue
> depth and hba queue depth play role in IO submission from above layer.
>
>  I have 24 JBOD connected to MR 12GB controller and I can see performance
> for 4K Sequential work load as below.
>
>  HBA QD for MR controller is 4065 and Per device QD is set to 32
>
>
>
> queue depth from  256 reports 300K IOPS
>
> queue depth from  128 reports 330K IOPS
>
> queue depth from  64 reports 360K IOPS
>
> queue depth from  32 reports 510K IOPS
>
>
>
> In MR driver I added debug print and confirm that more IO come to driver as
> random IO whenever I have  queue depth more than 32.
>
> I have debug using scsi logging level and blktrace as well. Below is snippet
> of logs using scsi logging level.  In summary, if SML do flow control of IO
> due to Device QD or HBA QD,  IO coming to LLD is more random pattern.
>
>
>
> I see IO coming to driver is not sequential.
>
>
>
> [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0 3b
> 00 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00
> 00 03 c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB:
> Write(10) 2a 00 00 03 c0 5b 00 00 01 00 <- After 3c it jumps to 5b. Sequence
> are overlapped. Due to sdev QD throttling.
>
> [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0 5c
> 00 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00
> 00 03 c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB:
> Write(10) 2a 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy]
> tag#857 CDB: Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd
> 18:2:21:0: [sdy] tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00
>
>
>
> If scsi_request_fn() breaks due to unavailability of device queue (due to
> below check), will there be any side defect as I observe ?
>
> if (!scsi_dev_queue_ready(q, sdev))
>
>  break;
>
>  If I reduce HBA QD and make sure IO from above layer is throttled due to
> HBA QD, there is a same impact.  MR driver use host wide shared tag map.
>
>  Can someone help me if this can be tunable in LLD providing additional
> settings or it is expected behavior ? Problem I am facing is, I am not able
> to figure out optimal device queue depth for different configuration and
> work load.
>
>  ` Kashyap
>
>
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Device or HBA QD throttling creates holes in Sequential work load

2016-10-20 Thread Kashyap Desai
[ Apologize, if this thread is reached to you multiple time ]

Hi,

I am doing some performance tuning in MR driver to understand how sdev
queue depth and hba queue depth play role in IO submission from above
layer. I have 24 JBOD connected to MR 12GB controller and I can see
performance for 4K Sequential work load as below.

HBA QD for MR controller is 4065 and Per device QD is set to 32

queue depth from  256 reports 300K IOPS
queue depth from  128 reports 330K IOPS
queue depth from  64 reports 360K IOPS
queue depth from  32 reports 510K IOPS

In MR driver I added debug print and confirm that more IO come to
driver as random IO whenever I have  queue depth more than 32.

I have debug using scsi logging level and blktrace as well. Below is
snippet of logs using scsi logging level.  In summary, if SML do flow
control of IO due to Device QD or HBA QD,  IO coming to LLD is more
random pattern.

I see IO coming to driver is not sequential.


[79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03
c0 3b 00 00 01 00
[79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00 00 03
c0 3c 00 00 01 00
 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB: Write(10) 2a 00 00 03
c0 5b 00 00 01 00 <- After 3c it jumps to 5b. Sequence are overlapped.
Due to sdev QD throttling.
[79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03
c0 5c 00 00 01 00
[79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00 00 03
c0 3d 00 00 01 00
[79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB: Write(10) 2a 00 00 03
c0 5d 00 00 01 00
[79546.912259] sd 18:2:21:0: [sdy] tag#857 CDB: Write(10) 2a 00 00 03
c0 3e 00 00 01 00
[79546.912268] sd 18:2:21:0: [sdy] tag#858 CDB: Write(10) 2a 00 00 03
c0 3f 00 00 01 00

 If scsi_request_fn() breaks due to unavailability of device queue
(due to below check), will there be any side defect as I observe ?

if (!scsi_dev_queue_ready(q, sdev))

 break;

If I reduce HBA QD and make sure IO from above layer is throttled due
to HBA QD, there is a same impact.  MR driver use host wide shared tag
map.

Can someone help me if this can be tunable in LLD providing additional
settings or it is expected behavior ? Problem I am facing is, I am not
able to figure out optimal device queue depth for different
configuration and work load.

Thanks, Kashyap
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] Fix: scsi: megaraid: reduce the scope of pending-list lock to avoid double lock

2016-10-20 Thread Kashyap Desai
> -Original Message-
> From: i...@itu.dk [mailto:i...@itu.dk]
> Sent: Monday, October 17, 2016 1:00 PM
> To: Jiri Kosina
> Cc: Kashyap Desai; Sumit Saxena; Uday Lingala; James E.J. Bottomley;
Martin K.
> Petersen; megaraidlinux@avagotech.com; linux-scsi@vger.kernel.org;
Iago
> Abal
> Subject: [PATCH] Fix: scsi: megaraid: reduce the scope of pending-list
lock to
> avoid double lock
>
> From: Iago Abal 
>
> The EBA code analyzer (https://github.com/models-team/eba) reported the
> following double lock:
>
> 1. In function `megaraid_reset_handler' at 2571;
> 2. take `&adapter->pend_list_lock' for the first time at 2602:
>
>// FIRST
>spin_lock_irqsave(PENDING_LIST_LOCK(adapter), flags);
>
> 3. enter the `list_for_each_entry_safe' loop at 2603;
> 4. call `megaraid_mbox_mm_done' at 2616;
> 5. call `megaraid_mbox_runpendq' at 3782;
> 6. take `&adapter->pend_list_lock' for the second time at 1892:
>
>// SECOND: DOUBLE LOCK !!!
>spin_lock_irqsave(PENDING_LIST_LOCK(adapter), flags);
>
> From my shallow understanding of the code (so please review carefully),
I think
> that it is not necessary to hold `PENDING_LIST_LOCK(adapter)' while
executing
> the body of the `list_for_each_entry_safe' loop. I assume this because
both
> `megaraid_mbox_mm_done' and `megaraid_dealloc_scb' are called from
> several places where, as far as I can tell, this lock is not hold. In
fact, as reported
> by EBA, at some point `megaraid_mbox_mm_done' will acquire this lock
again.
>
> Fixes: c005fb4fb2d2 ("[SCSI] megaraid_{mm,mbox}: fix a bug in reset
handler")
> Signed-off-by: Iago Abal 
> ---
>  drivers/scsi/megaraid/megaraid_mbox.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/scsi/megaraid/megaraid_mbox.c
> b/drivers/scsi/megaraid/megaraid_mbox.c
> index f0987f2..7f11898 100644
> --- a/drivers/scsi/megaraid/megaraid_mbox.c
> +++ b/drivers/scsi/megaraid/megaraid_mbox.c
> @@ -2603,6 +2603,7 @@ static DEF_SCSI_QCMD(megaraid_queue_command)
>   list_for_each_entry_safe(scb, tmp, &adapter->pend_list, list) {
>   list_del_init(&scb->list);  // from pending list
>
> + spin_unlock_irqrestore(PENDING_LIST_LOCK(adapter), flags);
>   if (scb->sno >= MBOX_MAX_SCSI_CMDS) {
>   con_log(CL_ANN, (KERN_WARNING
>   "megaraid: IOCTL packet with %d[%d:%d] being
> reset\n", @@ -2630,6 +2631,7 @@ static
> DEF_SCSI_QCMD(megaraid_queue_command)
>
>   megaraid_dealloc_scb(adapter, scb);
>   }
> + spin_lock_irqsave(PENDING_LIST_LOCK(adapter), flags);
>   }
>   spin_unlock_irqrestore(PENDING_LIST_LOCK(adapter), flags);

Looks correct, but please note that MEGARAID_MAILBOX and MEGARAID_MM is
not supported by LSI/ Broadcom.   We will revert back to you shortly if we
can safely remove those two modules.

.

>
> --
> 1.9.1
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Device or HBA level QD throttling creates randomness in sequetial workload

2016-10-20 Thread Kashyap Desai
[ Apologize, if you find more than one instance of my email.
Web based email client has some issue, so now trying git send mail.]

Hi,

I am doing some performance tuning in MR driver to understand how sdev queue 
depth and hba queue depth play role in IO submission from above layer.
I have 24 JBOD connected to MR 12GB controller and I can see performance for 4K 
Sequential work load as below.

HBA QD for MR controller is 4065 and Per device QD is set to 32

queue depth from  256 reports 300K IOPS 
queue depth from  128 reports 330K IOPS
queue depth from  64 reports 360K IOPS 
queue depth from  32 reports 510K IOPS

In MR driver I added debug print and confirm that more IO come to driver as 
random IO whenever I have  queue depth more than 32.

I have debug using scsi logging level and blktrace as well. Below is snippet of 
logs using scsi logging level.  In summary, if SML do flow control of IO due to 
Device QD or HBA QD, IO coming to LLD is more random pattern.

I see IO coming to driver is not sequential.

[79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0 3b 00 
00 01 00
[79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00 00 03 c0 3c 00 
00 01 00
[79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB: Write(10) 2a 00 00 03 c0 5b 00 
00 01 00 

 After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b". 
Two Sequence are overlapped due to sdev QD throttling.

[79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0 5c 00 
00 01 00
[79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00 00 03 c0 3d 00 
00 01 00
[79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB: Write(10) 2a 00 00 03 c0 5d 00 
00 01 00
[79546.912259] sd 18:2:21:0: [sdy] tag#857 CDB: Write(10) 2a 00 00 03 c0 3e 00 
00 01 00
[79546.912268] sd 18:2:21:0: [sdy] tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 
00 01 00

 If scsi_request_fn() breaks due to unavailability of device queue (due to 
below check), will there be any side defect as I observe ?
if (!scsi_dev_queue_ready(q, sdev))
 break;

If I reduce HBA QD and make sure IO from above layer is throttled due to HBA 
QD, there is a same impact.
MR driver use host wide shared tag map.

Can someone help me if this can be tunable in LLD providing additional settings 
or it is expected behavior ? Problem I am facing is, I am not able to figure 
out optimal device queue depth for different configuration and work load.

Thanks, Kashyap

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Device or HBA level QD throttling creates randomness in sequetial workload

2016-10-20 Thread Kashyap Desai
[ Apologize, if you find more than one instance of my email.
Web based email client has some issue, so now trying git send mail.]

Hi,

I am doing some performance tuning in MR driver to understand how sdev queue 
depth and hba queue depth play role in IO submission from above layer.
I have 24 JBOD connected to MR 12GB controller and I can see performance for 4K 
Sequential work load as below.

HBA QD for MR controller is 4065 and Per device QD is set to 32

queue depth from  256 reports 300K IOPS 
queue depth from  128 reports 330K IOPS
queue depth from  64 reports 360K IOPS 
queue depth from  32 reports 510K IOPS

In MR driver I added debug print and confirm that more IO come to driver as 
random IO whenever I have  queue depth more than 32.

I have debug using scsi logging level and blktrace as well. Below is snippet of 
logs using scsi logging level.  In summary, if SML do flow control of IO due to 
Device QD or HBA QD, IO coming to LLD is more random pattern.

I see IO coming to driver is not sequential.

[79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0 3b 00 
00 01 00
[79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00 00 03 c0 3c 00 
00 01 00
[79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB: Write(10) 2a 00 00 03 c0 5b 00 
00 01 00 

 After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b". 
Two Sequence are overlapped due to sdev QD throttling.

[79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0 5c 00 
00 01 00
[79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00 00 03 c0 3d 00 
00 01 00
[79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB: Write(10) 2a 00 00 03 c0 5d 00 
00 01 00
[79546.912259] sd 18:2:21:0: [sdy] tag#857 CDB: Write(10) 2a 00 00 03 c0 3e 00 
00 01 00
[79546.912268] sd 18:2:21:0: [sdy] tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 
00 01 00

 If scsi_request_fn() breaks due to unavailability of device queue (due to 
below check), will there be any side defect as I observe ?
if (!scsi_dev_queue_ready(q, sdev))
 break;

If I reduce HBA QD and make sure IO from above layer is throttled due to HBA 
QD, there is a same impact.
MR driver use host wide shared tag map.

Can someone help me if this can be tunable in LLD providing additional settings 
or it is expected behavior ? Problem I am facing is, I am not able to figure 
out optimal device queue depth for different configuration and work load.

Thanks, Kashyap

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 4/7] megaraid_sas: Send SYNCHRONIZE_CACHE command to firmware

2016-10-21 Thread Kashyap Desai
> -Original Message-
> From: James Bottomley [mailto:j...@linux.vnet.ibm.com]
> Sent: Friday, October 21, 2016 4:32 PM
> To: Sumit Saxena; linux-scsi@vger.kernel.org
> Cc: martin.peter...@oracle.com; the...@redhat.com;
> kashyap.de...@broadcom.com; sta...@vger.kernel.org
> Subject: Re: [PATCH v2 4/7] megaraid_sas: Send SYNCHRONIZE_CACHE
> command to firmware
>
> On Thu, 2016-10-20 at 02:05 -0700, Sumit Saxena wrote:
> > From previous patch we have below changes in v2 - 1.  Updated change
> > log.  Provided more detail in change log.
> > 2.  Agreed to remove module parameter. If we remove module parameter,
> > we
> > can ask customer to disable WCE on drive to get similar impact.
> > 3.  Always Send SYNCHRONIZE_CACHE  for JBOD (non Raid) Device to
> > Firmware.
> >
> > Current megaraid_sas driver returns SYNCHRONIZE_CACHE(related to Drive
> > Cache)  command  back to SCSI layer without sending it down to
> > firmware as firmware supposed to take care of flushing disk cache for
> > all drives belongs to Virtual Disk at the time of system
> > reboot/shutdown.
> >
> > We evaluate and understood that for Raid Volume, why driver should not
> > send SYNC_CACHE command to the Firmware.
> > There may be a some reason in past, but now it looks to be not
> > required and we have fixed this issue as part of this patch.
> >
> > Commit- " 02b01e0 [SCSI] megaraid_sas: return sync cache call with
> > success"
> > added the code in driver to return SYNCHRONIZE_CACHE without sending
> > it to firmware back in 2007. Earlier MR was mainly for Virtual Disk,
> > so same code continue for JBOD as well whenever JBOD support was added
> > and it introduced bug that SYNCHRONIZE_CACHE is not passed to FW for
> > JBOD (non Raid disk).
> >
> > But our recent analysis indicates that, From Day-1 MR Firmware always
> > expect Driver to forward SYNCHRONIZE_CACHE for JBOD (non Raid disk) to
> > the Firmware.
> > We have fixed this as part of this patch.
> >
> >  - Additional background -
> > There are some instance of MegaRaid FW (E.a Gen2 and Gen2.5 FW) set
> > WCE bit for Virtual Disk but firmware does not reply correct status
> > for SYNCH_CACHE.
> > It is very difficult to find out which Device ID and firmware has
> > capability to manage SYNC_CACHE, so we managed to figure out which are
> > the new firmware (canHandleSyncCache is set in scratch pad register at
> > 0xB4) and use that interface for correct behavior as explained above.
> >
> > E.g Liberator/Thunderbolt MegaRaid FW returns SYNC_CACHE as
> > Unsupported command (this is a firmware bug) and eventually command
> > will be failed to SML.
> > This will cause File system to go Read-only.
> >
> >  - New behavior described -
> >
> > IF application requests SYNCH_CACHE
> >IF 'JBOD'
> >   Driver sends SYNCH_CACHE command to the FW
> >FW sends SYNCH_CACHE to drive
> >FW obtains status from drive and returns same status
> > back to driver
> >ELSEIF 'VirtualDisk'
> >IF any FW which supports new API bit called
> > canHandleSyncCache
> >   Driver sends SYNCH_CACHE command to the
> > FW
> >   FW does not send SYNCH_CACHE to drives
> >   FW returns SUCCESS
> >ELSE
> >   Driver does not send SYNCH_CACHE command
> > to the FW.
> >   Driver return SUCCESS for that command.
> >ENDIF
> >     ENDIF
> > ENDIF
> >
> > CC: sta...@vger.kernel.org
>
> Can you please split this into two, so we can backport the bug fix without
> any of
> the feature addition junk?

James  - I am sending new patch making two logical fix as you requested.
JBOD fix will be marked for CC:stable and for Virtual disk I will post only
for head (not for stable). Let me know if resending complete series is good
or resending this patch is better.

>
> > Signed-off-by: Kashyap Desai 
> > Signed-off-by: Sumit Saxena 
> > ---
> >  drivers/scsi/megaraid/megaraid_sas.h|  3 +++
> >  drivers/scsi/megaraid/megaraid_sas_base.c   | 10 ++
> >  drivers/scsi/megaraid/megaraid_sas_fusion.c |  5 +
> >  3 files changed, 10 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/scsi/megaraid/megaraid_sas.h
> > b/drivers/scsi/megaraid/megaraid_sas.h
> > index ca86c88..43fd14f 100644
> > --- a/drivers/scsi/me

RE: Device or HBA level QD throttling creates randomness in sequetial workload

2016-10-21 Thread Kashyap Desai
Hi -

I found below conversation and it is on the same line as I wanted some
input from mailing list.

http://marc.info/?l=linux-kernel&m=147569860526197&w=2

I can do testing on any WIP item as Omar mentioned in above discussion.
https://github.com/osandov/linux/tree/blk-mq-iosched

Is there any workaround/alternative in latest upstream kernel, if user
wants to see limited penalty  for Sequential Work load on HDD ?

` Kashyap

> -Original Message-----
> From: Kashyap Desai [mailto:kashyap.de...@broadcom.com]
> Sent: Thursday, October 20, 2016 3:39 PM
> To: linux-scsi@vger.kernel.org
> Subject: Device or HBA level QD throttling creates randomness in
sequetial
> workload
>
> [ Apologize, if you find more than one instance of my email.
> Web based email client has some issue, so now trying git send mail.]
>
> Hi,
>
> I am doing some performance tuning in MR driver to understand how sdev
> queue depth and hba queue depth play role in IO submission from above
layer.
> I have 24 JBOD connected to MR 12GB controller and I can see performance
for
> 4K Sequential work load as below.
>
> HBA QD for MR controller is 4065 and Per device QD is set to 32
>
> queue depth from  256 reports 300K IOPS queue depth from  128
> reports 330K IOPS queue depth from  64 reports 360K IOPS queue
depth
> from  32 reports 510K IOPS
>
> In MR driver I added debug print and confirm that more IO come to driver
as
> random IO whenever I have  queue depth more than 32.
>
> I have debug using scsi logging level and blktrace as well. Below is
snippet of
> logs using scsi logging level.  In summary, if SML do flow control of IO
due to
> Device QD or HBA QD, IO coming to LLD is more random pattern.
>
> I see IO coming to driver is not sequential.
>
> [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0
3b 00
> 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00
00 03
> c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB:
Write(10) 2a
> 00 00 03 c0 5b 00 00 01 00
>
>  After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b".
> Two Sequence are overlapped due to sdev QD throttling.
>
> [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0
5c 00
> 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00
00 03
> c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB:
Write(10) 2a
> 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] tag#857
CDB:
> Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd 18:2:21:0:
[sdy]
> tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00
>
>  If scsi_request_fn() breaks due to unavailability of device queue (due
to below
> check), will there be any side defect as I observe ?
> if (!scsi_dev_queue_ready(q, sdev))
>  break;
>
> If I reduce HBA QD and make sure IO from above layer is throttled due to
HBA
> QD, there is a same impact.
> MR driver use host wide shared tag map.
>
> Can someone help me if this can be tunable in LLD providing additional
settings
> or it is expected behavior ? Problem I am facing is, I am not able to
figure out
> optimal device queue depth for different configuration and work load.
>
> Thanks, Kashyap
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/7] megaraid_sas: Updates for scsi-next

2016-10-21 Thread Kashyap Desai
Changes in v3
 two logical patches created for "Send SYNCHRONIZE_CACHE command to firmware"
- Send SYNCHRONIZE_CACHE for JBOD to firmware 
- Send SYNCHRONIZE_CACHE for VD to firmware 

Changes in v2:
1.  Removed unconditional msleep and moved calls to atomic_read into 
megasas_wait_for_adapter_operational
2.  Updated change log for patch #4.  Provided more detail in change log.
3.  Agreed to remove module parameter. If we remove module parameter,
we can ask customer to disable WCE on drive to get similar impact.
4.  Always Send SYNCHRONIZE_CACHE for JBOD (non Raid) Device to Firmware.
5. Add log message printing the state of FW sync cache support
6. Moved version update patch to end of series

Sumit Saxena (7):
  megaraid_sas: For SRIOV enabled firmware, ensure VF driver waits for
30secs before reset
  megaraid_sas: Send correct PhysArm to FW for R1 VD downgrade
  megaraid_sas: Do not fire DCMDs during PCI shutdown/detach
  megaraid_sas: Send SYNCHRONIZE_CACHE for JBOD to firmware
  megaraid_sas: Send SYNCHRONIZE_CACHE for VD to firmware
  MAINTAINERS: Update megaraid maintainers list
  megaraid_sas: Do not set MPI2_TYPE_CUDA for JBOD FP path for FW which
does not support JBOD sequence map
  megaraid_sas: driver version upgrade

 MAINTAINERS | 10 +++---
 drivers/scsi/megaraid/megaraid_sas.h|  7 +++--
 drivers/scsi/megaraid/megaraid_sas_base.c   | 49 -
 drivers/scsi/megaraid/megaraid_sas_fp.c |  6 ++--
 drivers/scsi/megaraid/megaraid_sas_fusion.c | 23 +-
 5 files changed, 71 insertions(+), 24 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/8] megaraid_sas: Send correct PhysArm to FW for R1 VD downgrade

2016-10-21 Thread Kashyap Desai
This patch fixes the issue of wrong PhysArm was sent to firmware for R1
VD downgrade.

Signed-off-by: Kiran Kumar Kasturi 
Signed-off-by: Sumit Saxena 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Tomas Henzl 
---
 drivers/scsi/megaraid/megaraid_sas_fp.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas_fp.c 
b/drivers/scsi/megaraid/megaraid_sas_fp.c
index e413113..f237d00 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fp.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fp.c
@@ -782,7 +782,8 @@ static u8 mr_spanset_get_phy_params(struct megasas_instance 
*instance, u32 ld,
(raid->regTypeReqOnRead != REGION_TYPE_UNUSED
pRAID_Context->regLockFlags = REGION_TYPE_EXCLUSIVE;
else if (raid->level == 1) {
-   pd = MR_ArPdGet(arRef, physArm + 1, map);
+   physArm = physArm + 1;
+   pd = MR_ArPdGet(arRef, physArm, map);
if (pd != MR_PD_INVALID)
*pDevHandle = MR_PdDevHandleGet(pd, map);
}
@@ -879,7 +880,8 @@ u8 MR_GetPhyParams(struct megasas_instance *instance, u32 
ld, u64 stripRow,
pRAID_Context->regLockFlags = REGION_TYPE_EXCLUSIVE;
else if (raid->level == 1) {
/* Get alternate Pd. */
-   pd = MR_ArPdGet(arRef, physArm + 1, map);
+   physArm = physArm + 1;
+   pd = MR_ArPdGet(arRef, physArm, map);
if (pd != MR_PD_INVALID)
/* Get dev handle from Pd */
*pDevHandle = MR_PdDevHandleGet(pd, map);
-- 
2.4.11

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/8] megaraid_sas: Do not fire DCMDs during PCI shutdown/detach

2016-10-21 Thread Kashyap Desai
This patch addresses the issue of driver firing DCMDs in PCI
shutdown/detach path irrespective of firmware state.
Driver will check for whether firmware is operational state or not
before firing DCMDs. If firmware is in unrecoverbale
state or does not become operational within specfied time, driver will
skip firing DCMDs.

Signed-off-by: Sumit Saxena 
Signed-off-by: Shivasharan Srikanteshwara 

---
 drivers/scsi/megaraid/megaraid_sas_base.c   | 39 +
 drivers/scsi/megaraid/megaraid_sas_fusion.c |  9 ---
 2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c 
b/drivers/scsi/megaraid/megaraid_sas_base.c
index c3efcc7..ba57be6 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -6248,6 +6248,34 @@ fail_reenable_msix:
 #define megasas_resume NULL
 #endif
 
+static inline int
+megasas_wait_for_adapter_operational(struct megasas_instance *instance)
+{
+   int wait_time = MEGASAS_RESET_WAIT_TIME * 2;
+   int i;
+
+   if (atomic_read(&instance->adprecovery) == MEGASAS_HW_CRITICAL_ERROR)
+   return 1;
+
+   for (i = 0; i < wait_time; i++) {
+   if (atomic_read(&instance->adprecovery) == 
MEGASAS_HBA_OPERATIONAL)
+   break;
+
+   if (!(i % MEGASAS_RESET_NOTICE_INTERVAL))
+   dev_notice(&instance->pdev->dev, "waiting for 
controller reset to finish\n");
+
+   msleep(1000);
+   }
+
+   if (atomic_read(&instance->adprecovery) != MEGASAS_HBA_OPERATIONAL) {
+   dev_info(&instance->pdev->dev, "%s timed out while waiting for 
HBA to recover.\n",
+   __func__);
+   return 1;
+   }
+
+   return 0;
+}
+
 /**
  * megasas_detach_one -PCI hot"un"plug entry point
  * @pdev:  PCI device structure
@@ -6272,9 +6300,14 @@ static void megasas_detach_one(struct pci_dev *pdev)
if (instance->fw_crash_state != UNAVAILABLE)
megasas_free_host_crash_buffer(instance);
scsi_remove_host(instance->host);
+
+   if (megasas_wait_for_adapter_operational(instance))
+   goto skip_firing_dcmds;
+
megasas_flush_cache(instance);
megasas_shutdown_controller(instance, MR_DCMD_CTRL_SHUTDOWN);
 
+skip_firing_dcmds:
/* cancel the delayed work if this work still in queue*/
if (instance->ev != NULL) {
struct megasas_aen_event *ev = instance->ev;
@@ -6388,8 +6421,14 @@ static void megasas_shutdown(struct pci_dev *pdev)
struct megasas_instance *instance = pci_get_drvdata(pdev);
 
instance->unload = 1;
+
+   if (megasas_wait_for_adapter_operational(instance))
+   goto skip_firing_dcmds;
+
megasas_flush_cache(instance);
megasas_shutdown_controller(instance, MR_DCMD_CTRL_SHUTDOWN);
+
+skip_firing_dcmds:
instance->instancet->disable_intr(instance);
megasas_destroy_irqs(instance);
 
diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c 
b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index 0de7de3..ec626fc 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -2463,12 +2463,15 @@ irqreturn_t megasas_isr_fusion(int irq, void *devp)
/* Start collecting crash, if DMA bit is done */
if ((fw_state == MFI_STATE_FAULT) && dma_state)
schedule_work(&instance->crash_init);
-   else if (fw_state == MFI_STATE_FAULT)
-   schedule_work(&instance->work_init);
+   else if (fw_state == MFI_STATE_FAULT) {
+   if (instance->unload == 0)
+   schedule_work(&instance->work_init);
+   }
} else if (fw_state == MFI_STATE_FAULT) {
dev_warn(&instance->pdev->dev, "Iop2SysDoorbellInt"
   "for scsi%d\n", instance->host->host_no);
-   schedule_work(&instance->work_init);
+   if (instance->unload == 0)
+   schedule_work(&instance->work_init);
}
}
 
-- 
2.4.11

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/8] megaraid_sas: For SRIOV enabled firmware, ensure VF driver waits for 30secs before reset

2016-10-21 Thread Kashyap Desai
For SRIOV enabled firmware, if there is a OCR(online controller reset)
possibility driver set the convert flag to 1, which is not happening if
there are outstanding commands even after 180 seconds.
As driver does not set convert flag to 1 and still making the OCR to run,
VF(Virtual function) driver is directly writing on to the register
instead of waiting for 30 seconds. Setting convert flag to 1 will cause
VF driver will wait for 30 secs before going for reset.

CC: sta...@vger.kernel.org
Signed-off-by: Kiran Kumar Kasturi 
Signed-off-by: Sumit Saxena 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Tomas Henzl 
---
 drivers/scsi/megaraid/megaraid_sas_fusion.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c 
b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index ec83754..0de7de3 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -2823,6 +2823,7 @@ int megasas_wait_for_outstanding_fusion(struct 
megasas_instance *instance,
dev_err(&instance->pdev->dev, "pending commands remain after 
waiting, "
   "will reset adapter scsi%d.\n",
   instance->host->host_no);
+   *convert = 1;
retval = 1;
}
 out:
-- 
2.4.11

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/8] megaraid_sas: Send SYNCHRONIZE_CACHE for non-raid to firmware

2016-10-21 Thread Kashyap Desai
Commit- " 02b01e0 [SCSI] megaraid_sas: return sync cache call with success"
added the code in driver to return SYNCHRONIZE_CACHE without sending it to
firmware back in 2007. Earlier MR was mainly for Virtual Disk,
so same code continue for JBOD as well whenever JBOD support was added and it 
introduced bug that
SYNCHRONIZE_CACHE is not passed to FW for JBOD (non Raid disk).

But our recent analysis indicates that, From Day-1 MR Firmware always
expect Driver to forward SYNCHRONIZE_CACHE for JBOD (non Raid disk) to the
Firmware.
We have fixed this as part of this patch.

CC: sta...@vger.kernel.org
Signed-off-by: Kashyap Desai 
Signed-off-by: Sumit Saxena 
---
 drivers/scsi/megaraid/megaraid_sas_base.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c 
b/drivers/scsi/megaraid/megaraid_sas_base.c
index ba57be6..c98d4f9 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -1700,16 +1700,13 @@ megasas_queue_command(struct Scsi_Host *shost, struct 
scsi_cmnd *scmd)
goto out_done;
}
 
-   switch (scmd->cmnd[0]) {
-   case SYNCHRONIZE_CACHE:
-   /*
-* FW takes care of flush cache on its own
-* No need to send it down
-*/
+   /*
+* FW takes care of flush cache on its own for Virtual Disk.
+* No need to send it down for VD. For JBOD send SYNCHRONIZE_CACHE to 
FW.
+*/
+   if ((scmd->cmnd[0] == SYNCHRONIZE_CACHE) && MEGASAS_IS_LOGICAL(scmd)) {
scmd->result = DID_OK << 16;
goto out_done;
-   default:
-   break;
}
 
return instance->instancet->build_and_issue_cmd(instance, scmd);
-- 
2.4.11

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 5/8] megaraid_sas: Send SYNCHRONIZE_CACHE for VD to firmware

2016-10-21 Thread Kashyap Desai
>From previous patch we have below changes in v2 (only for Virtual Disk)-
1.  Updated change log.  Provided more detail in change log.
2.  Agreed to remove module parameter. If we remove module parameter, we
can ask customer to disable WCE on drive to get similar impact.
3.  Always Send SYNCHRONIZE_CACHE  for JBOD (non Raid) Device to Firmware.

Current megaraid_sas driver returns SYNCHRONIZE_CACHE(related to Drive
Cache)  command  back to SCSI layer without sending it down to firmware as
firmware supposed to take care of flushing disk cache for all drives
belongs to Virtual Disk at the time of system reboot/shutdown.

We evaluate and understood that for Raid Volume, why driver should not
send SYNC_CACHE command to the Firmware.
There may be a some reason in past, but now it looks to be not required and
we have fixed this issue as part of this patch.

 - Additional background -
There are some instance of MegaRaid FW (E.a Gen2 and Gen2.5 FW) set WCE bit
for Virtual Disk but firmware does not reply correct status for SYNCH_CACHE.
It is very difficult to find out which Device ID and firmware has capability
to manage SYNC_CACHE, so we managed to figure out which are the new firmware
(canHandleSyncCache is set in scratch pad register at 0xB4) and use that
interface for correct behavior as explained above.

E.g Liberator/Thunderbolt MegaRaid FW returns SYNC_CACHE as Unsupported
command (this is a firmware bug) and eventually command will be failed to SML.
This will cause File system to go Read-only.

 - New behavior described -

IF application requests SYNCH_CACHE
   IF 'JBOD'
  Driver sends SYNCH_CACHE command to the FW
   FW sends SYNCH_CACHE to drive
   FW obtains status from drive and returns same status back to 
driver
   ELSEIF 'VirtualDisk'
   IF any FW which supports new API bit called canHandleSyncCache
  Driver sends SYNCH_CACHE command to the FW
  FW does not send SYNCH_CACHE to drives
  FW returns SUCCESS
   ELSE
  Driver does not send SYNCH_CACHE command to the 
FW.
  Driver return SUCCESS for that command.
   ENDIF
ENDIF
ENDIF

Signed-off-by: Kashyap Desai 
Signed-off-by: Sumit Saxena 
---
 drivers/scsi/megaraid/megaraid_sas.h| 3 +++
 drivers/scsi/megaraid/megaraid_sas_base.c   | 7 ++-
 drivers/scsi/megaraid/megaraid_sas_fusion.c | 5 +
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas.h 
b/drivers/scsi/megaraid/megaraid_sas.h
index ca86c88..43fd14f 100644
--- a/drivers/scsi/megaraid/megaraid_sas.h
+++ b/drivers/scsi/megaraid/megaraid_sas.h
@@ -1429,6 +1429,8 @@ enum FW_BOOT_CONTEXT {
 #define MR_MAX_REPLY_QUEUES_EXT_OFFSET_SHIFT14
 #define MR_MAX_MSIX_REG_ARRAY   16
 #define MR_RDPQ_MODE_OFFSET0X0080
+#define MR_CAN_HANDLE_SYNC_CACHE_OFFSET0X0100
+
 /*
 * register set for both 1068 and 1078 controllers
 * structure extended for 1078 registers
@@ -2140,6 +2142,7 @@ struct megasas_instance {
u8 is_imr;
u8 is_rdpq;
bool dev_handle;
+   bool fw_sync_cache_support;
 };
 struct MR_LD_VF_MAP {
u32 size;
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c 
b/drivers/scsi/megaraid/megaraid_sas_base.c
index c98d4f9..236b8ed 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -1700,11 +1700,8 @@ megasas_queue_command(struct Scsi_Host *shost, struct 
scsi_cmnd *scmd)
goto out_done;
}
 
-   /*
-* FW takes care of flush cache on its own for Virtual Disk.
-* No need to send it down for VD. For JBOD send SYNCHRONIZE_CACHE to 
FW.
-*/
-   if ((scmd->cmnd[0] == SYNCHRONIZE_CACHE) && MEGASAS_IS_LOGICAL(scmd)) {
+   if ((scmd->cmnd[0] == SYNCHRONIZE_CACHE) && MEGASAS_IS_LOGICAL(scmd) &&
+   (!instance->fw_sync_cache_support)) {
scmd->result = DID_OK << 16;
goto out_done;
}
diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c 
b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index ec626fc..0edeeb1 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -748,6 +748,11 @@ megasas_ioc_init_fusion(struct megasas_instance *instance)
goto fail_fw_init;
}
 
+   instance->fw_sync_cache_support = (scratch_pad_2 &
+   MR_CAN_HANDLE_SYNC_CACHE_OFFSET) ? 1 : 0;
+   dev_info(&instance->pdev->dev, "FW supports sync cache\t: %s\n",
+instance->fw_sync_cache_support ? "Yes" : "No");
+
IOCInitMessage =
  dma_alloc_coherent(&instance->pdev-&

[PATCH v3 6/8] MAINTAINERS: Update megaraid maintainers list

2016-10-21 Thread Kashyap Desai
Update MEGARAID drivers maintainers list.

Signed-off-by: Sumit Saxena 
Reviewed-by: Hannes Reinecke 
---
 MAINTAINERS | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index f0ee7a6..8b9117f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7612,12 +7612,12 @@ S:  Maintained
 F: drivers/net/wireless/mediatek/mt7601u/
 
 MEGARAID SCSI/SAS DRIVERS
-M: Kashyap Desai 
-M: Sumit Saxena 
-M: Uday Lingala 
-L: megaraidlinux@avagotech.com
+M: Kashyap Desai 
+M: Sumit Saxena 
+M: Shivasharan S 
+L: megaraidlinux@broadcom.com
 L: linux-scsi@vger.kernel.org
-W: http://www.lsi.com
+W: http://www.avagotech.com/support/
 S: Maintained
 F: Documentation/scsi/megaraid.txt
 F: drivers/scsi/megaraid.*
-- 
2.4.11

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 8/8] megaraid_sas: driver version upgrade

2016-10-21 Thread Kashyap Desai
Signed-off-by: Sumit Saxena 
---
 drivers/scsi/megaraid/megaraid_sas.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas.h 
b/drivers/scsi/megaraid/megaraid_sas.h
index 43fd14f..74c7b44 100644
--- a/drivers/scsi/megaraid/megaraid_sas.h
+++ b/drivers/scsi/megaraid/megaraid_sas.h
@@ -35,8 +35,8 @@
 /*
  * MegaRAID SAS Driver meta data
  */
-#define MEGASAS_VERSION"06.811.02.00-rc1"
-#define MEGASAS_RELDATE"April 12, 2016"
+#define MEGASAS_VERSION"06.812.07.00-rc1"
+#define MEGASAS_RELDATE"August 22, 2016"
 
 /*
  * Device IDs
-- 
2.4.11

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 7/8] megaraid_sas: Do not set MPI2_TYPE_CUDA for JBOD FP path for FW which does not support JBOD sequence map

2016-10-21 Thread Kashyap Desai
CC: sta...@vger.kernel.org
Signed-off-by: Sumit Saxena 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Tomas Henzl 
---
 drivers/scsi/megaraid/megaraid_sas_fusion.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c 
b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index 0edeeb1..1cf0e49 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -2005,6 +2005,8 @@ megasas_build_syspd_fusion(struct megasas_instance 
*instance,
io_request->DevHandle = pd_sync->seq[pd_index].devHandle;
pRAID_Context->regLockFlags |=

(MR_RL_FLAGS_SEQ_NUM_ENABLE|MR_RL_FLAGS_GRANT_DESTINATION_CUDA);
+   pRAID_Context->Type = MPI2_TYPE_CUDA;
+   pRAID_Context->nseg = 0x1;
} else if (fusion->fast_path_io) {
pRAID_Context->VirtualDiskTgtId = cpu_to_le16(device_id);
pRAID_Context->configSeqNum = 0;
@@ -2040,12 +2042,10 @@ megasas_build_syspd_fusion(struct megasas_instance 
*instance,
pRAID_Context->timeoutValue =
cpu_to_le16((os_timeout_value > timeout_limit) ?
timeout_limit : os_timeout_value);
-   if (fusion->adapter_type == INVADER_SERIES) {
-   pRAID_Context->Type = MPI2_TYPE_CUDA;
-   pRAID_Context->nseg = 0x1;
+   if (fusion->adapter_type == INVADER_SERIES)
io_request->IoFlags |=

cpu_to_le16(MPI25_SAS_DEVICE0_FLAGS_ENABLED_FAST_PATH);
-   }
+
cmd->request_desc->SCSIIO.RequestFlags =
(MPI2_REQ_DESCRIPT_FLAGS_FP_IO <<
MEGASAS_REQ_DESCRIPT_FLAGS_TYPE_SHIFT);
-- 
2.4.11

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 4/7] megaraid_sas: Send SYNCHRONIZE_CACHE command to firmware

2016-10-21 Thread Kashyap Desai
> > >
> > > Can you please split this into two, so we can backport the bug fix
> > > without any of the feature addition junk?
> >
> > James  - I am sending new patch making two logical fix as you
> > requested. JBOD fix will be marked for CC:stable and for Virtual disk
> > I will post only for head (not for stable). Let me know if resending
> > complete series is good or resending this patch is better.
>
> This is up to Martin, because he has to sort it out, but I think we can
> cope with
> just the two patches.  The bug fix one has to go into the fixes branch for
> immediate application and the rest of the series will go into the misc
> branch for
> the 4.10 merge window.

I resend patch set ( v3 series) . Martin, please use latest resend v3 for
4.10 merge.

Need Reviewed-by tag for below patch as reworked from original submission.
[PATCH v3 3/8] megaraid_sas: Do not fire DCMDs during PCI shutdown/detach
[PATCH v3 4/8] megaraid_sas: Send SYNCHRONIZE_CACHE for non-raid to firmware
[PATCH v3 5/8] megaraid_sas: Send SYNCHRONIZE_CACHE for VD to firmware

>
> James
>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Device or HBA level QD throttling creates randomness in sequetial workload

2016-10-24 Thread Kashyap Desai
>
> On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote:
> > Hi -
> >
> > I found below conversation and it is on the same line as I wanted some
> > input from mailing list.
> >
> > http://marc.info/?l=linux-kernel&m=147569860526197&w=2
> >
> > I can do testing on any WIP item as Omar mentioned in above
discussion.
> > https://github.com/osandov/linux/tree/blk-mq-iosched

I tried build kernel using this repo, but looks like it is not allowed to
reboot due to some changes in  layer.

>
> Are you using blk-mq for this disk? If not, then the work there won't
affect you.

YES. I am using blk-mq for my test. I also confirm if use_blk_mq is
disable, Sequential work load issue is not seen and  scheduling works
well.

>
> > Is there any workaround/alternative in latest upstream kernel, if user
> > wants to see limited penalty  for Sequential Work load on HDD ?
> >
> > ` Kashyap
> >
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Device or HBA level QD throttling creates randomness in sequetial workload

2016-10-24 Thread Kashyap Desai
> -Original Message-
> From: Omar Sandoval [mailto:osan...@osandov.com]
> Sent: Monday, October 24, 2016 9:11 PM
> To: Kashyap Desai
> Cc: linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org; linux-
> bl...@vger.kernel.org; ax...@kernel.dk; Christoph Hellwig;
> paolo.vale...@linaro.org
> Subject: Re: Device or HBA level QD throttling creates randomness in
sequetial
> workload
>
> On Mon, Oct 24, 2016 at 06:35:01PM +0530, Kashyap Desai wrote:
> > >
> > > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote:
> > > > Hi -
> > > >
> > > > I found below conversation and it is on the same line as I wanted
> > > > some input from mailing list.
> > > >
> > > > http://marc.info/?l=linux-kernel&m=147569860526197&w=2
> > > >
> > > > I can do testing on any WIP item as Omar mentioned in above
> > discussion.
> > > > https://github.com/osandov/linux/tree/blk-mq-iosched
> >
> > I tried build kernel using this repo, but looks like it is not allowed
> > to reboot due to some changes in  layer.
>
> Did you build the most up-to-date version of that branch? I've been
force
> pushing to it, so the commit id that you built would be useful.
> What boot failure are you seeing?

Below  is latest commit on repo.
commit b077a9a5149f17ccdaa86bc6346fa256e3c1feda
Author: Omar Sandoval 
Date:   Tue Sep 20 11:20:03 2016 -0700

[WIP] blk-mq: limit bio queue depth

I have latest repo from 4.9/scsi-next maintained by Martin which boots
fine.  Only Delta is  " CONFIG_SBITMAP" is enabled in WIP blk-mq-iosched
branch. I could not see any meaningful data on boot hang, so going to try
one more time tomorrow.


>
> > >
> > > Are you using blk-mq for this disk? If not, then the work there
> > > won't
> > affect you.
> >
> > YES. I am using blk-mq for my test. I also confirm if use_blk_mq is
> > disable, Sequential work load issue is not seen and  scheduling
> > works well.
>
> Ah, okay, perfect. Can you send the fio job file you're using? Hard to
tell exactly
> what's going on without the details. A sequential workload with just one
> submitter is about as easy as it gets, so this _should_ be behaving
nicely.



; setup numa policy for each thread
; 'numactl --show' to determine the maximum numa nodes
[global]
ioengine=libaio
buffered=0
rw=write
bssplit=4K/100
iodepth=256
numjobs=1
direct=1
runtime=60s
allow_mounted_write=0

[job1]
filename=/dev/sdd
..
[job24]
filename=/dev/sdaa

When I tune /sys/module/scsi_mod/parameters/use_blk_mq = 1, below is a
ioscheduler detail. (It is in blk-mq mode. )
/sys/devices/pci:00/:00:02.0/:02:00.0/host10/target10:2:13/10:
2:13:0/block/sdq/queue/scheduler:none

When I have set /sys/module/scsi_mod/parameters/use_blk_mq = 0,
ioscheduler picked by SML is .
/sys/devices/pci:00/:00:02.0/:02:00.0/host10/target10:2:13/10:
2:13:0/block/sdq/queue/scheduler:noop deadline [cfq]

I see in blk-mq performance is very low for Sequential Write work load and
I confirm that blk-mq convert Sequential work load into random stream due
to  io-scheduler change in blk-mq vs legacy block layer.

>
> > >
> > > > Is there any workaround/alternative in latest upstream kernel, if
> > > > user wants to see limited penalty  for Sequential Work load on HDD
?
> > > >
> > > > ` Kashyap
> > > >
>
> P.S., your emails are being marked as spam by Gmail. Actually, Gmail
seems to
> mark just about everything I get from Broadcom as spam due to failed
DMARC.
>
> --
> Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Device or HBA level QD throttling creates randomness in sequetial workload

2016-10-31 Thread Kashyap Desai
Jens- Replied inline.


Omar -  I tested your WIP repo and figure out System hangs only if I pass "
scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I am
looking for scsi_mod.use_blk_mq=Y.

Also below is snippet of blktrace. In case of higher per device QD, I see
Requeue request in blktrace.

65,128 10 6268 2.432404509 18594  P   N [fio]
 65,128 10 6269 2.432405013 18594  U   N [fio] 1
 65,128 10 6270 2.432405143 18594  I  WS 148800 + 8 [fio]
 65,128 10 6271 2.432405740 18594  R  WS 148800 + 8 [0]
 65,128 10 6272 2.432409794 18594  Q  WS 148808 + 8 [fio]
 65,128 10 6273 2.432410234 18594  G  WS 148808 + 8 [fio]
 65,128 10 6274 2.432410424 18594  S  WS 148808 + 8 [fio]
 65,128 23 3626 2.432432595 16232  D  WS 148800 + 8 [kworker/23:1H]
 65,128 22 3279 2.432973482 0  C  WS 147432 + 8 [0]
 65,128  7 6126 2.433032637 18594  P   N [fio]
 65,128  7 6127 2.433033204 18594  U   N [fio] 1
 65,128  7 6128 2.433033346 18594  I  WS 148808 + 8 [fio]
 65,128  7 6129 2.433033871 18594  D  WS 148808 + 8 [fio]
 65,128  7 6130 2.433034559 18594  R  WS 148808 + 8 [0]
 65,128  7 6131 2.433039796 18594  Q  WS 148816 + 8 [fio]
 65,128  7 6132 2.433040206 18594  G  WS 148816 + 8 [fio]
 65,128  7 6133 2.433040351 18594  S  WS 148816 + 8 [fio]
 65,128  9 6392 2.433133729 0  C  WS 147240 + 8 [0]
 65,128  9 6393 2.433138166   905  D  WS 148808 + 8 [kworker/9:1H]
 65,128  7 6134 2.433167450 18594  P   N [fio]
 65,128  7 6135 2.433167911 18594  U   N [fio] 1
 65,128  7 6136 2.433168074 18594  I  WS 148816 + 8 [fio]
 65,128  7 6137 2.433168492 18594  D  WS 148816 + 8 [fio]
 65,128  7 6138 2.433174016 18594  Q  WS 148824 + 8 [fio]
 65,128  7 6139 2.433174282 18594  G  WS 148824 + 8 [fio]
 65,128  7 6140 2.433174613 18594  S  WS 148824 + 8 [fio]
CPU0 (sdy):
 Reads Queued:   0,0KiB  Writes Queued:  79,
316KiB
 Read Dispatches:0,0KiB  Write Dispatches:   67,
18,446,744,073PiB
 Reads Requeued: 0   Writes Requeued:86
 Reads Completed:0,0KiB  Writes Completed:   98,
392KiB
 Read Merges:0,0KiB  Write Merges:0,
0KiB
 Read depth: 0   Write depth: 5
 IO unplugs:79   Timer unplugs:   0



` Kashyap

> -Original Message-
> From: Jens Axboe [mailto:ax...@kernel.dk]
> Sent: Monday, October 31, 2016 10:54 PM
> To: Kashyap Desai; Omar Sandoval
> Cc: linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org; linux-
> bl...@vger.kernel.org; Christoph Hellwig; paolo.vale...@linaro.org
> Subject: Re: Device or HBA level QD throttling creates randomness in
> sequetial
> workload
>
> Hi,
>
> One guess would be that this isn't around a requeue condition, but rather
> the
> fact that we don't really guarantee any sort of hard FIFO behavior between
> the
> software queues. Can you try this test patch to see if it changes the
> behavior for
> you? Warning: untested...

Jens - I tested the patch, but I still see random IO pattern for expected
Sequential Run. I am intentionally running case of Re-queue  and seeing
issue at the time of Re-queue.
If there is no Requeue, I see no issue at LLD.


>
> diff --git a/block/blk-mq.c b/block/blk-mq.c index
> f3d27a6dee09..5404ca9c71b2
> 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -772,6 +772,14 @@ static inline unsigned int queued_to_index(unsigned
> int
> queued)
>   return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
>   }
>
> +static int rq_pos_cmp(void *priv, struct list_head *a, struct list_head
> +*b) {
> + struct request *rqa = container_of(a, struct request, queuelist);
> + struct request *rqb = container_of(b, struct request, queuelist);
> +
> + return blk_rq_pos(rqa) < blk_rq_pos(rqb); }
> +
>   /*
>* Run this hardware queue, pulling any software queues mapped to it in.
>* Note that this function currently has various problems around
> ordering @@ -
> 812,6 +820,14 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx
> *hctx)
>   }
>
>   /*
> +  * If the device is rotational, sort the list sanely to avoid
> +  * unecessary seeks. The software queues are roughly FIFO, but
> +  * only roughly, there are no hard guarantees.
> +  */
> + if (!blk_queue_nonrot(q))
> + list_sort(NULL, &rq_list, rq_pos_cmp);
> +
> + /*
>* Start off with dptr being NULL, so we start the first request
>* immediately, even if we have more pending.
>*/
>
> --
> Jens Axboe
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [REGRESSION] 4.9-rc4+ doesn't boot on my test box

2016-11-08 Thread Kashyap Desai
> -Original Message-
> From: Martin K. Petersen [mailto:martin.peter...@oracle.com]
> Sent: Wednesday, November 09, 2016 4:45 AM
> To: Jens Axboe
> Cc: linux-scsi; Kashyap Desai; Martin K. Petersen
> Subject: Re: [REGRESSION] 4.9-rc4+ doesn't boot on my test box
>
> >>>>> "Jens" == Jens Axboe  writes:
>
> Jens> I wasted half a day on this, thinking it was something in my
> Jens> 4.10 branches. But it turns out it is not, the regression is in
> Jens> mainline.

Jens -

Sorry for trouble.  I did not validated this single patch. I validated
complete patch set.
Issue is improper MACRO usage MEGASAS_IS_LOGICAL, which gives incorrect
check condition in qcmd Path.

Below is proposed fix.


diff --git a/drivers/scsi/megaraid/megaraid_sas.h
b/drivers/scsi/megaraid/megaraid_sas.h
index 74c7b44..0d2625b 100644
--- a/drivers/scsi/megaraid/megaraid_sas.h
+++ b/drivers/scsi/megaraid/megaraid_sas.h
@@ -2236,7 +2236,7 @@ struct megasas_instance_template {
 };

 #define MEGASAS_IS_LOGICAL(scp)
\
-   (scp->device->channel < MEGASAS_MAX_PD_CHANNELS) ? 0 : 1
+   ((scp->device->channel < MEGASAS_MAX_PD_CHANNELS) ? 0 : 1)

 #define MEGASAS_DEV_INDEX(scp) \
(((scp->device->channel % 2) * MEGASAS_MAX_DEV_PER_CHANNEL) +   \


>
> Kashyap, have you tested the stable fix without the remainder of the
driver
> update in place?

Martin -  I validated whole series.  Apologies for this.

Please help me to know how to fix this ? Do I need to send only fix on top
of latest commit (as posted above - MACRO definition) for this issue ?

>
> --
> Martin K. PetersenOracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [REGRESSION] 4.9-rc4+ doesn't boot on my test box

2016-11-08 Thread Kashyap Desai
> -Original Message-
> From: Jens Axboe [mailto:ax...@kernel.dk]
> Sent: Wednesday, November 09, 2016 6:50 AM
> To: Kashyap Desai; Martin K. Petersen
> Cc: linux-scsi
> Subject: Re: [REGRESSION] 4.9-rc4+ doesn't boot on my test box
>
> On 11/08/2016 05:20 PM, Kashyap Desai wrote:
> >> -Original Message-
> >> From: Martin K. Petersen [mailto:martin.peter...@oracle.com]
> >> Sent: Wednesday, November 09, 2016 4:45 AM
> >> To: Jens Axboe
> >> Cc: linux-scsi; Kashyap Desai; Martin K. Petersen
> >> Subject: Re: [REGRESSION] 4.9-rc4+ doesn't boot on my test box
> >>
> >>>>>>> "Jens" == Jens Axboe  writes:
> >>
> >> Jens> I wasted half a day on this, thinking it was something in my
> >> Jens> 4.10 branches. But it turns out it is not, the regression is in
> >> Jens> mainline.
> >
> > Jens -
> >
> > Sorry for trouble.  I did not validated this single patch. I validated
> > complete patch set.
> > Issue is improper MACRO usage MEGASAS_IS_LOGICAL, which gives
> > incorrect check condition in qcmd Path.
> >
> > Below is proposed fix.
> >
> >
> > diff --git a/drivers/scsi/megaraid/megaraid_sas.h
> > b/drivers/scsi/megaraid/megaraid_sas.h
> > index 74c7b44..0d2625b 100644
> > --- a/drivers/scsi/megaraid/megaraid_sas.h
> > +++ b/drivers/scsi/megaraid/megaraid_sas.h
> > @@ -2236,7 +2236,7 @@ struct megasas_instance_template {  };
> >
> >  #define MEGASAS_IS_LOGICAL(scp)
> > \
> > -   (scp->device->channel < MEGASAS_MAX_PD_CHANNELS) ? 0 : 1
>
> Ugh... So we're completing everything immediately.
>
> > Martin -  I validated whole series.  Apologies for this.
> >
> > Please help me to know how to fix this ? Do I need to send only fix on
> > top of latest commit (as posted above - MACRO definition) for this issue
> > ?
>
> Send a fix on top of current -git asap. The current tree is completely
> broken for
> any megaraid user. -rc4 is no time to send in untested patches, especially
> not
> something that claims to fix a 9 year old bug and is marked for stable as
> well.


I will run some more test (using patch set only marked for stable from last
series) and submit fix ASAP.

>
> --
> Jens Axboe
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Delivery Status Notification for linuxr...@lsi.com

2016-11-10 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Paul Menzel
> Sent: Thursday, November 10, 2016 7:40 PM
> To: Martin K. Petersen
> Cc: linux-scsi@vger.kernel.org; James E.J. Bottomley
> Subject: Re: Delivery Status Notification for linuxr...@lsi.com
>
> Dear Martin,
>
>
> On 11/10/16 15:07, Martin K. Petersen wrote:
> >> "Paul" == Paul Menzel  writes:
>
> > Paul> Probably you know it already, but the listed email address of
> > Paul> the 3WARE SCSI drivers maintainer linuxr...@lsi.com doesn’t work
> > Paul> (for me).
> >
> > Ownership of these products is now with Broadcom. To my knowledge the
> > 3ware product lines have been discontinued.
>
> Indeed. I forgot to actually formulate the intend of my message.
>
> What should happen to the entry in the file `MAINTAINERS`?
3Ware product is discontinue from support point of view.  We are discussing
with LSI/Broadcom management team to clean up this part and provide more
clarity.
Please wait for reply as we are waiting for internal communication.
>
>
> Kind regards,
>
> Paul
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] Fix: scsi: megaraid: reduce the scope of pending-list lock to avoid double lock

2016-11-10 Thread Kashyap Desai
> -Original Message-
> From: i...@itu.dk [mailto:i...@itu.dk]
> Sent: Monday, October 17, 2016 1:00 PM
> To: Jiri Kosina
> Cc: Kashyap Desai; Sumit Saxena; Uday Lingala; James E.J. Bottomley;
Martin K.
> Petersen; megaraidlinux@avagotech.com; linux-scsi@vger.kernel.org;
Iago
> Abal
> Subject: [PATCH] Fix: scsi: megaraid: reduce the scope of pending-list
lock to
> avoid double lock
>
> From: Iago Abal 
>
> The EBA code analyzer (https://github.com/models-team/eba) reported the
> following double lock:
>
> 1. In function `megaraid_reset_handler' at 2571;
> 2. take `&adapter->pend_list_lock' for the first time at 2602:
>
>// FIRST
>spin_lock_irqsave(PENDING_LIST_LOCK(adapter), flags);
>
> 3. enter the `list_for_each_entry_safe' loop at 2603;
> 4. call `megaraid_mbox_mm_done' at 2616;
> 5. call `megaraid_mbox_runpendq' at 3782;
> 6. take `&adapter->pend_list_lock' for the second time at 1892:
>
>// SECOND: DOUBLE LOCK !!!
>spin_lock_irqsave(PENDING_LIST_LOCK(adapter), flags);
>
> From my shallow understanding of the code (so please review carefully),
I think
> that it is not necessary to hold `PENDING_LIST_LOCK(adapter)' while
executing
> the body of the `list_for_each_entry_safe' loop. I assume this because
both
> `megaraid_mbox_mm_done' and `megaraid_dealloc_scb' are called from
> several places where, as far as I can tell, this lock is not hold. In
fact, as reported
> by EBA, at some point `megaraid_mbox_mm_done' will acquire this lock
again.
>
> Fixes: c005fb4fb2d2 ("[SCSI] megaraid_{mm,mbox}: fix a bug in reset
handler")
> Signed-off-by: Iago Abal 
> ---
>  drivers/scsi/megaraid/megaraid_mbox.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/scsi/megaraid/megaraid_mbox.c
> b/drivers/scsi/megaraid/megaraid_mbox.c
> index f0987f2..7f11898 100644
> --- a/drivers/scsi/megaraid/megaraid_mbox.c
> +++ b/drivers/scsi/megaraid/megaraid_mbox.c
> @@ -2603,6 +2603,7 @@ static DEF_SCSI_QCMD(megaraid_queue_command)
>   list_for_each_entry_safe(scb, tmp, &adapter->pend_list, list) {
>   list_del_init(&scb->list);  // from pending list
>
> + spin_unlock_irqrestore(PENDING_LIST_LOCK(adapter), flags);
>   if (scb->sno >= MBOX_MAX_SCSI_CMDS) {
>   con_log(CL_ANN, (KERN_WARNING
>   "megaraid: IOCTL packet with %d[%d:%d] being
> reset\n", @@ -2630,6 +2631,7 @@ static
> DEF_SCSI_QCMD(megaraid_queue_command)
>
>   megaraid_dealloc_scb(adapter, scb);
>   }
> + spin_lock_irqsave(PENDING_LIST_LOCK(adapter), flags);
>   }
>   spin_unlock_irqrestore(PENDING_LIST_LOCK(adapter), flags);

Sorry for delay. We had internal discussion and confirm that it is safe to
remove mbox driver from mainline as this product is discontinued and we
are planning to post patch to remove megaraid mbox driver from mainline.

>
> --
> 1.9.1
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Reduced latency is killing performance

2016-11-11 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Jens Axboe
> Sent: Thursday, November 10, 2016 10:18 PM
> To: Hannes Reinecke; Christoph Hellwig
> Cc: SCSI Mailing List; linux-bl...@vger.kernel.org
> Subject: Re: Reduced latency is killing performance
>
> On 11/10/2016 09:04 AM, Hannes Reinecke wrote:
> > Hi all,
> >
> > this really feels like a follow-up to the discussion we've had in
> > Santa Fe, but finally I'm able to substantiate it with some numbers.
> >
> > I've made a patch to enable the megaraid_sas driver for multiqueue.
> > While this is pretty straightforward (I'll be sending the patchset
> > later on), the results are ... interesting.
> >
> > I've run the 'ssd-test.fio' script from Jens' repository, and these
> > results for MQ/SQ (- is mq, + is sq):
> >
> >  Run status group 0 (all jobs):
> > -   READ: io=10641MB, aggrb=181503KB/s, minb=181503KB/s,
> > maxb=181503KB/s, mint=60033msec, maxt=60033msec
> > +   READ: io=18370MB, aggrb=312572KB/s, minb=312572KB/s,
> > maxb=312572KB/s, mint=60181msec, maxt=60181msec
> >
> >  Run status group 1 (all jobs):
> > -   READ: io=441444KB, aggrb=7303KB/s, minb=7303KB/s, maxb=7303KB/s,
> > mint=60443msec, maxt=60443msec
> > +   READ: io=223108KB, aggrb=3707KB/s, minb=3707KB/s, maxb=3707KB/s,
> > mint=60182msec, maxt=60182msec
> >
> >  Run status group 2 (all jobs):
> > -  WRITE: io=22485MB, aggrb=383729KB/s, minb=383729KB/s,
> > maxb=383729KB/s, mint=60001msec, maxt=60001msec
> > +  WRITE: io=47421MB, aggrb=807581KB/s, minb=807581KB/s,
> > maxb=807581KB/s, mint=60129msec, maxt=60129msec
> >
> >  Run status group 3 (all jobs):
> > -  WRITE: io=489852KB, aggrb=8110KB/s, minb=8110KB/s, maxb=8110KB/s,
> > mint=60399msec, maxt=60399msec
> > +  WRITE: io=489748KB, aggrb=8134KB/s, minb=8134KB/s, maxb=8134KB/s,
> > mint=60207msec, maxt=60207msec
> >
> >  Disk stats (read/write):
> > -  sda: ios=2834412/5878578, merge=0/0, ticks=86269292/48364836,
> > in_queue=135345876, util=99.20%
> > +  sda: ios=205278/2680329, merge=4552593/9580622,
> > ticks=12539912/12965228, in_queue=25512312, util=99.59%
> >
> > As you can see, we're really losing performance in the multiqueue case.
> > And the main reason for that is that we submit about _10 times_ as
> > much I/O as we do for the single-queue case.
>
> What's the setup like? I'm going to need more details.
>
> The baseline test is using the legacy path, single queue. The new test is
> multiqueue, scsi-mq. What's sda?

Hannes -

Please share setup/config details so that I can also validate and post my
findings.

` Kashyap

>
> > So I guess having an I/O scheduler is critical, even for the scsi-mq
> > case.
>
> Each of these sections is a single job. For some reason we are not merging
> as
> well as we should, that's the reason for the performance loss. In fact,
> we're not
> merging at all. That's not IO scheduling.
>
> --
> Jens Axboe
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [GIT PULL] SCSI fixes for 4.9-rc3

2016-11-11 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Gabriel C
> Sent: Friday, November 11, 2016 9:40 AM
> To: James Bottomley; Andrew Morton; Linus Torvalds
> Cc: linux-scsi; linux-kernel; sta...@vger.kernel.org
> Subject: Re: [GIT PULL] SCSI fixes for 4.9-rc3
>
>
>
> On 11.11.2016 04:30, Gabriel C wrote:
> >
> > On 05.11.2016 14:29, James Bottomley wrote:
> >
> >
> > ...
> >
> >> Kashyap Desai (1):
> >>   scsi: megaraid_sas: Fix data integrity failure for JBOD
> >> (passthrough) devices
> >>
> >> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> >> b/drivers/scsi/megaraid/megaraid_sas_base.c
> >> index 9ff57de..d8b1fbd 100644
> >> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> >> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> >> @@ -1700,16 +1700,13 @@ megasas_queue_command(struct Scsi_Host
> *shost, struct scsi_cmnd *scmd)
> >>goto out_done;
> >>}
> >>
> >> -  switch (scmd->cmnd[0]) {
> >> -  case SYNCHRONIZE_CACHE:
> >> -  /*
> >> -   * FW takes care of flush cache on its own
> >> -   * No need to send it down
> >> -   */
> >> +  /*
> >> +   * FW takes care of flush cache on its own for Virtual Disk.
> >> +   * No need to send it down for VD. For JBOD send
> SYNCHRONIZE_CACHE to FW.
> >> +   */
> >> +  if ((scmd->cmnd[0] == SYNCHRONIZE_CACHE) &&
> >> +MEGASAS_IS_LOGICAL(scmd)) {
> >>scmd->result = DID_OK << 16;
> >>goto out_done;
> >> -  default:
> >> -  break;
> >>}
> >>
> >>return instance->instancet->build_and_issue_cmd(instance, scmd);
> >
> > This patch breaks my box.. I'm not able to boot it anymore.
> > It seems with this patch I have /dev/sda[a-z] to /dev/sdz[a-z] ?!?
> >
> > I'm not sure how to get an log since dracut times out and I'm dropped
> > , after a very long time of probing 'ghost devices', in a emercency
> > shell,
> journalctl doesn't work also..
> >
> > After reverting this one I can boot normal.
> >
> > Box is a FUJITSU PRIMERGY TX200 S5..

Please  check now commit.  Below commit has complete fix.

http://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?id=5e5ec1759dd663a1d5a2f10930224dd009e500e8


> >
> > This is from an working kernel..
> >
> > [5.119371] megaraid_sas :01:00.0: FW now in Ready state
> > [5.119418] megaraid_sas :01:00.0: firmware supports msix
> > : (0)
> > [5.119420] megaraid_sas :01:00.0: current msix/online cpus
> > : (1/16)
> > [5.119422] megaraid_sas :01:00.0: RDPQ mode : (disabled)
> > [5.123100] ehci-pci :00:1a.7: cache line size of 32 is not
> > supported
> > [5.123113] ehci-pci :00:1a.7: irq 18, io mem 0xb002
> >
> > ...
> >
> > [5.208063] megaraid_sas :01:00.0: controller type   :
> > MR(256MB)
> > [5.208065] megaraid_sas :01:00.0: Online Controller Reset(OCR)
> > :
> Enabled
> > [5.208067] megaraid_sas :01:00.0: Secure JBOD support   : No
> > [5.208070] megaraid_sas :01:00.0: megasas_init_mfi:
> fw_support_ieee=0
> > [5.208073] megaraid_sas :01:00.0: INIT adapter done
> > [5.208075] megaraid_sas :01:00.0: Jbod map is not supported
> megasas_setup_jbod_map 4967
> > [5.230163] megaraid_sas :01:00.0: MR_DCMD_PD_LIST_QUERY
> failed/not supported by firmware
> > [5.252080] megaraid_sas :01:00.0: DCMD not supported by
> > firmware -
> megasas_ld_list_query 4369
> > [5.274086] megaraid_sas :01:00.0: pci id:
> (0x1000)/(0x0060)/(0x1734)/(0x10f9)
> > [5.274089] megaraid_sas :01:00.0: unevenspan support: no
> > [5.274090] megaraid_sas :01:00.0: firmware crash dump   : no
> > [5.274092] megaraid_sas :01:00.0: jbod sync map : no
> > [5.274094] scsi host0: Avago SAS based MegaRAID driver
> > [5.280022] scsi 0:0:6:0: Direct-Access ATA  WDC WD5002ABYS-5
> > 3B06
> PQ: 0 ANSI: 5
> > [5.282153] scsi 0:0:7:0: Direct-Access ATA  WDC WD5002ABYS-5
> > 3B06
> PQ: 0 ANSI: 5
> > [5.285180] scsi 0:0:10:0: Direct-Access ATA  ST500NM0011
> > FTM6 PQ:
> 0 ANSI: 5
> > [5.369885] scsi 0:2:0:0: Direct-Access LSI  MegaRAID SAS RMB
> > 1.40 PQ:
> 0 ANSI: 5
> >
> > ..
> >
> > Please let me know if you need more infos and/or want me to test
> > patches.
> >
> >
>
> I managed to get some parts of the broken dmesg. There it is :
>
> http://ftp.frugalware.org/pub/other/people/crazy/kernel/broken-dmesg
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] hpsa: scsi-mq support

2016-11-14 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Sunday, November 13, 2016 5:29 PM
> To: Hannes Reinecke
> Cc: Christoph Hellwig; Hannes Reinecke; Christoph Hellwig; Martin K.
Petersen;
> James Bottomley; Don Brace; linux-scsi@vger.kernel.org; Jens Axboe
> Subject: Re: [PATCH] hpsa: scsi-mq support
>
> On Sun, Nov 13, 2016 at 10:44:47AM +0100, Hannes Reinecke wrote:
> > One day to mark with bright red in the calendar.
> >
> > Christoph Hellwig is telling me _NOT_ to use scsi-mq.
>
> That's not what I'm doing.
>
> > This patch was done so see what would needed to be done to convert a
> > legacy driver.
> > As I was under the impression that scsi-mq is the way forward, seeing
> > that it should be enabled per default.
> > But I must have been mistaken. Apparently.
>
> What I am doing is to tell you you should not expose multiple queues
unless the
> hardware actually has multiple submissions queues.  The blk-mq and
scsi-mq
> code works just fine with a single submission queue, and the hpsa driver
in
> particular works really well with scsi-mq and a single submission queue.
E.g. the
> RAID HBA on slide 19 of this presentations is an hpsa one:

I have similar results for MegaRaid where seen MR driver gives significant
improvement for Single Submission Queue and multiple Completion Queue.
Having said that scsi-mq is enabled but with single Queue is more than
enough to maximize improvement of SCSI-MQ.

Major advantage was seen while IO load is cross the boundary of Physical
CPU socket.

>From this discussion I understood that - Similar logical changes proposed
for megaraid_sas  and we are not really going to gain with fake multiple
submission exposed to SML.

Kashyap

>
> http://events.linuxfoundation.org/sites/events/files/slides/scsi.pdf
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 4/5] megaraid_sas: scsi-mq support

2016-11-14 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Hannes Reinecke
> Sent: Friday, November 11, 2016 3:15 PM
> To: Martin K. Petersen
> Cc: Christoph Hellwig; James Bottomley; Sumit Saxena; linux-
> s...@vger.kernel.org; Hannes Reinecke; Hannes Reinecke
> Subject: [PATCH 4/5] megaraid_sas: scsi-mq support
>
> This patch enables full scsi multiqueue support. But as I have been
seeing
> performance regressions I've also added a module parameter 'use_blk_mq',
> allowing multiqueue to be switched off if required.

Hannes - As discussed for hpsa driver related thread for similar support,
I don't think this code changes are helping MR as well.

>
> Signed-off-by: Hannes Reinecke 
> ---
>  drivers/scsi/megaraid/megaraid_sas_base.c   | 22 +
>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 38
+
> 
>  2 files changed, 50 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> b/drivers/scsi/megaraid/megaraid_sas_base.c
> index d580406..1af47e6 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -48,6 +48,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -104,6 +105,9 @@
>  module_param(scmd_timeout, int, S_IRUGO);
> MODULE_PARM_DESC(scmd_timeout, "scsi command timeout (10-90s), default
> 90s. See megasas_reset_timer.");
>
> +bool use_blk_mq = true;
> +module_param_named(use_blk_mq, use_blk_mq, bool, 0);
> +
>  MODULE_LICENSE("GPL");
>  MODULE_VERSION(MEGASAS_VERSION);
>  MODULE_AUTHOR("megaraidlinux@avagotech.com");
> @@ -1882,6 +1886,17 @@ static void megasas_slave_destroy(struct
> scsi_device *sdev)
>   sdev->hostdata = NULL;
>  }
>
> +static int megasas_map_queues(struct Scsi_Host *shost) {
> + struct megasas_instance *instance = (struct megasas_instance *)
> + shost->hostdata;
> +
> + if (!instance->ctrl_context)
> + return 0;
> +
> + return blk_mq_pci_map_queues(&shost->tag_set, instance->pdev, 0);
}
> +
>  /*
>  * megasas_complete_outstanding_ioctls - Complete outstanding ioctls
after a
>  *   kill adapter
> @@ -2986,6 +3001,7 @@ struct device_attribute *megaraid_host_attrs[] = {
>   .slave_configure = megasas_slave_configure,
>   .slave_alloc = megasas_slave_alloc,
>   .slave_destroy = megasas_slave_destroy,
> + .map_queues = megasas_map_queues,
>   .queuecommand = megasas_queue_command,
>   .eh_target_reset_handler = megasas_reset_target,
>   .eh_abort_handler = megasas_task_abort, @@ -5610,6 +5626,12 @@
> static int megasas_io_attach(struct megasas_instance *instance)
>   host->max_id = MEGASAS_MAX_DEV_PER_CHANNEL;
>   host->max_lun = MEGASAS_MAX_LUN;
>   host->max_cmd_len = 16;
> + host->nr_hw_queues = instance->msix_vectors ?
> + instance->msix_vectors : 1;
> + if (use_blk_mq) {
> + host->can_queue = instance->max_scsi_cmds / host-
> >nr_hw_queues;
> + host->use_blk_mq = 1;
> + }
>
>   /*
>* Notify the mid-layer about the new controller diff --git
> a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> index eb3cb0f..aba53c0 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> @@ -1703,6 +1703,7 @@ static int megasas_create_sg_sense_fusion(struct
> megasas_instance *instance)
>   struct IO_REQUEST_INFO io_info;
>   struct fusion_context *fusion;
>   struct MR_DRV_RAID_MAP_ALL *local_map_ptr;
> + u16 msix_index;
>   u8 *raidLUN;
>
>   device_id = MEGASAS_DEV_INDEX(scp);
> @@ -1792,11 +1793,13 @@ static int megasas_create_sg_sense_fusion(struct
> megasas_instance *instance)
>   fp_possible = io_info.fpOkForIo;
>   }
>
> - /* Use raw_smp_processor_id() for now until cmd->request->cpu is
CPU
> -id by default, not CPU group id, otherwise all MSI-X queues
won't
> -be utilized */
> - cmd->request_desc->SCSIIO.MSIxIndex = instance->msix_vectors ?
> - raw_smp_processor_id() % instance->msix_vectors : 0;
> + if (shost_use_blk_mq(instance->host)) {
> + u32 blk_tag = blk_mq_unique_tag(scp->request);
> + msix_index = blk_mq_unique_tag_to_hwq(blk_tag);
> + } else
> + msix_index = instance->msix_vectors ?
> + raw_smp_processor_id() % instance->msix_vectors :
0;
> + cmd->request_desc->SCSIIO.MSIxIndex = msix_index;
>
>   if (fp_possible) {
>   megasas_set_pd_lba(io_request, scp->cmd_len, &io_info,
scp,
> @@ -1971,6 +1974,7 @@ static void megasas_build_ld_nonrw_fusion(struct
> megasas_instance *instance,
>   struct RAID_CONTEXT *pRAID_Context;
>   struct MR_PD_CFG_SEQ_NUM_SYNC *pd_sync;
>   struct fusion_context 

RE: [PATCH 5/5] megaraid_sas: add mmio barrier after register writes

2016-11-18 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Tomas Henzl
> Sent: Friday, November 18, 2016 9:23 PM
> To: Hannes Reinecke; Martin K. Petersen
> Cc: Christoph Hellwig; James Bottomley; Sumit Saxena; linux-
> s...@vger.kernel.org; Hannes Reinecke
> Subject: Re: [PATCH 5/5] megaraid_sas: add mmio barrier after register
writes
>
> On 11.11.2016 10:44, Hannes Reinecke wrote:
> > The megaraid_sas HBA only has a single register for I/O submission,
> > which will be hit pretty hard with scsi-mq. To ensure that the PCI
> > writes have made it across we need to add a mmio barrier after each
> > write; otherwise I've been seeing spurious command completions and I/O
> > stalls.
>
> Why is it needed that the PCI write reaches the hw exactly at this
point?
> Is it possible that this is a hw deficiency like that the hw can't
handle
> communication without tiny pauses, and so possible to remove in next
> generation?
> Thanks,
> Tomas

I think this is good to have mmiowb as we are already doing  for writel()
version of megasas_return_cmd_fusion.
May be not for x86, but for some other CPU arch it is useful. I think it
become more evident while scs-mq support for more than one submission
queue patch of Hannes expose this issue.

Probably this patch is good. Intermediate PCI device (PCI bridge ?)  may
cache PCI packet and eventually out of order PCI packet to MegaRAID HBA
can cause this type of spurious completion.

>
> >
> > Signed-off-by: Hannes Reinecke 
> > ---
> >  drivers/scsi/megaraid/megaraid_sas_fusion.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> > b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> > index aba53c0..729a654 100644
> > --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> > +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> > @@ -196,6 +196,7 @@ inline void megasas_return_cmd_fusion(struct
> megasas_instance *instance,
> > le32_to_cpu(req_desc->u.low));
> >
> > writeq(req_data, &instance->reg_set->inbound_low_queue_port);
> > +   mmiowb();
> >  #else
> > unsigned long flags;
> >
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH][V2] scsi: megaraid-sas: fix spelling mistake of "outstanding"

2016-11-29 Thread Kashyap Desai
> -Original Message-
> From: Bart Van Assche [mailto:bart.vanass...@sandisk.com]
> Sent: Wednesday, November 30, 2016 12:50 AM
> To: Colin King; Kashyap Desai; Sumit Saxena; Shivasharan S; James E . J .
> Bottomley; Martin K . Petersen; megaraidlinux@broadcom.com; linux-
> s...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Subject: Re: [PATCH][V2] scsi: megaraid-sas: fix spelling mistake of
> "outstanding"
>
> On 11/29/2016 11:13 AM, Colin King wrote:
> > Trivial fix to spelling mistake "oustanding" to "outstanding" in
> > dev_info and scmd_printk messages. Also join wrapped literal string in
> > the scmd_printk.
>
> Reviewed-by: Bart Van Assche 

Please hold this commit as we have patch set work in progress for MegaRaid
Driver. It will conflict it this patch commits.
We will be re-submitting this patch with appropriate Signed-off and
Submitted-by tags.

Thanks, Kashyap
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 5/5] megaraid_sas: add mmio barrier after register writes

2016-11-29 Thread Kashyap Desai
> -Original Message-
> From: Tomas Henzl [mailto:the...@redhat.com]
> Sent: Monday, November 21, 2016 9:27 PM
> To: Kashyap Desai; Hannes Reinecke; Martin K. Petersen
> Cc: Christoph Hellwig; James Bottomley; Sumit Saxena; linux-
> s...@vger.kernel.org; Hannes Reinecke
> Subject: Re: [PATCH 5/5] megaraid_sas: add mmio barrier after register
> writes
>
> On 18.11.2016 17:48, Kashyap Desai wrote:
> >> -Original Message-
> >> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> >> ow...@vger.kernel.org] On Behalf Of Tomas Henzl
> >> Sent: Friday, November 18, 2016 9:23 PM
> >> To: Hannes Reinecke; Martin K. Petersen
> >> Cc: Christoph Hellwig; James Bottomley; Sumit Saxena; linux-
> >> s...@vger.kernel.org; Hannes Reinecke
> >> Subject: Re: [PATCH 5/5] megaraid_sas: add mmio barrier after
> >> register
> > writes
> >> On 11.11.2016 10:44, Hannes Reinecke wrote:
> >>> The megaraid_sas HBA only has a single register for I/O submission,
> >>> which will be hit pretty hard with scsi-mq. To ensure that the PCI
> >>> writes have made it across we need to add a mmio barrier after each
> >>> write; otherwise I've been seeing spurious command completions and
> >>> I/O stalls.
> >> Why is it needed that the PCI write reaches the hw exactly at this
> > point?
> >> Is it possible that this is a hw deficiency like that the hw can't
> > handle
> >> communication without tiny pauses, and so possible to remove in next
> >> generation?
> >> Thanks,
> >> Tomas
> > I think this is good to have mmiowb as we are already doing  for
> > writel() version of megasas_return_cmd_fusion.
> > May be not for x86, but for some other CPU arch it is useful. I think
> > it become more evident while scs-mq support for more than one
> > submission queue patch of Hannes expose this issue.
> >
> > Probably this patch is good. Intermediate PCI device (PCI bridge ?)
> > may cache PCI packet and eventually out of order PCI packet to
> > MegaRAID HBA can cause this type of spurious completion.
>
> Usually drivers do not add a write barrier after every pci write, unless
> like here in
> megasas_fire_cmd_fusion in the 32bit part where are two paired writes and
> it
> must be ensured that the pair arrives without any other write in between.
>
> Why is it wrong when a pci write is overtaken by another write or when
> happens
> a bit later and if it is wrong - don't we need an additional locking too ?
> The execution of  megasas_fire_cmd_fusion might be interrupted and a delay
> can happen at any time.

Since Hannes mentioned that with his experiment of mq megaraid_sas patch and
creating more Submission queue to SML cause invalid/suprious completion in
his code, I am trying to understand if " mmiowb" after writeq() call is safe
?  My understanding is "writeq" is atomic and it will have two 32bit PCI
WRITE in same sequence reaching to PCI end device. Assuming there will not
be PCI level caching on Intel x86 platform.  E.g if we have two CPU
executing writeq(), PCI write will always reach to end device in same
sequence. Assume ...Tag-1, Tag-2, Tag-3 and Tag-4 is expected sequence.  In
case of any optimization at system level, if device see Tag-1, Tag-3, Tag-2,
Tag-4 arrives,  then we may see the issue as Hannes experience.  We have
experience very rare instance of dual/spurious SMID completion  on released
megaraid_sas driver and not easy to be reproduced...so thinking on those
line, is this easy to reproduce such issue opening more submission queue to
SML (just to reproduce spurious completion effectively). We will apply all
the patches posted by Hannes *just* to understand this particular spurious
completion issue and understand in what condition it will impact.

We will post if this mmiowb() call after writeq() is good to have.

~ Kashyap

>
> tomash
>
> >
> >>> Signed-off-by: Hannes Reinecke 
> >>> ---
> >>>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 1 +
> >>>  1 file changed, 1 insertion(+)
> >>>
> >>> diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> >>> b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> >>> index aba53c0..729a654 100644
> >>> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> >>> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> >>> @@ -196,6 +196,7 @@ inline void megasas_return_cmd_fusion(struct
> >> megasas_instance *instance,
> >>>   le32_to_cpu(req_desc->u.low));
> >>>
> >>>   writeq(req_data, &instance->re

RE: [PATCH] Update 3ware driver email addresses

2016-12-07 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Martin K. Petersen
> Sent: Thursday, December 08, 2016 4:43 AM
> To: adam radford
> Cc: linux-scsi; Kashyap Desai
> Subject: Re: [PATCH] Update 3ware driver email addresses
>
> >>>>> "Adam" == adam radford  writes:
>
> Adam,
>
> Adam> This maintainers/email update patch didn't get picked up.  Do I
> Adam> need to fix it or re-send ?
>
> I still have it in the queue. Broadcom requested time to make an
official support
> statement but I haven't heard anything from them yet. Kashyap?


Martin -
Official support statement from Broadcom -
LSI/Broadcom stopped supporting 3Ware controllers.  If Adam volunteer
keeping it alive, we would like to remove www.lsi.com references and make
it purely his driver.

Adam - Can you resend patch where "www.lsi.com" is removed from 3Ware
drivers. ? Me/Sumit will ack and Martin can take pick for next release.

>
> --
> Martin K. PetersenOracle Linux Engineering
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: SCSI: usage of DID_REQUEUE vs DID_RESET for returning SCSI commands to be retried

2016-12-14 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Hannes Reinecke
> Sent: Wednesday, December 14, 2016 9:07 PM
> To: Sumit Saxena; linux-scsi
> Subject: Re: SCSI: usage of DID_REQUEUE vs DID_RESET for returning SCSI
> commands to be retried
>
> On 12/13/2016 02:19 PM, Sumit Saxena wrote:
> > Hi all,
> >
> > I have query regarding usage of host_byte DID_REQUEUE vs DID_RESET
> > returned by LLD to SCSI mid layer.
> >
> > Let me give some background here.
> > I am using megaraid_sas controller. megaraid_sas driver returns all
> > outstanding SCSI commands back to SCSI layer with DID_RESET host_byte
> > before resetting controller.
> > The intent is- all these commands returned with DID_RESET before
> > controller reset should be retried by SCSI.
> >
> > In few distros, I have observed that if SYNCHRONIZE_CACHE
> > command(should be applicable for all non Read write commands) is
> > outstanding before resetting controller  and driver returns those
> > commands back with DID_RESET then SYNCHRONIZE_CACHE command not
> > retried but failed to upper layer but other READ/WRITE commands were
> > not failed but retried. I was running filesystem IOs and
> > SYNHRONIZE_CACHE returned with error cause filesystem to go in READ
> > only mode.
> >
> > Later I checked and realized below commit was missing in that distro
> > kernel and seems to cause the problem-
> >
> > a621bac scsi_lib: correctly retry failed zero length REQ_TYPE_FS
> > commands
> >
> > But distro kernel has below commit-
> >
> > 89fb4cd scsi: handle flush errors properly
> >
> > Issue does not hit on the kernels which don't have both commits but
> > hits when commit- "89fb4cd scsi: handle flush errors properly " is
> > there but commit-  "a621bac scsi_lib: correctly retry failed zero
> > length REQ_TYPE_FS commands" is missing.
> >
> > This issue is observed with mpt3sas driver as well and should be
> > applicable to all LLDs returning non read write commands with DID_RESET.
> >
> > Returning DID_REQUEUE instead of DID_RESET from driver solves the
> > problem irrespective of whether these above mentioned commits are
> > there or not in kernel.
> > So I am thinking to use DID_REQUEUE instead of DID_RESET in
> > megaraid_sas driver for all SCSI commands(not only limited to
> > SYNCHRONIZE_CACHE or non read write commands) before resetting
> > controller. As I mentioned intent is those outstanding commands
> > returned by driver before doing controller reset should be retried and
> > as soon as reset is complete, driver will be processing those
> > commands. There is no sense key associated with SCSI commands
> > returned.
> >
> > I browsed SCSI code and get to know DID_REQUEUE causes command to be
> > retried by calling- scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY).
> > This seems to be good enough for our requirement of commands to be
> > re-tried by SCSI layer.
> >
> > Please provide feedback if anyone forsee any issue with using
> > DID_REQUEUE for this use case.
> > I will be doing some testing with DID_REQUEUE in place of DID_RESET in
> > megaraid_sas driver.
> >
> > I have attached here a proposed patch for megaraid_sas driver.
> > If there are no concerns, I will send this patch to SCSI mailing list.
> >
> Hmm.
>
> DID_RESET is meant to be an indicator to the SCSI midlayer that the host /
> device was reset, and the command _should_ be retried.
> DID_REQUEUE OTOH is an indicator to the SCSI midlayer to retry the
> command.
>
> The problem with DID_RESET is that it's slightly underspecified; there is
> no
> indication if the command has been processed (and if so, up to which
> parts?)
> DID_REQUEUE, OTOH, will always cause a retry.
>
> So yeah, I guess DID_REQUEUE is a better choice here.

Thanks Hannes. We thought DID_REQUEUE functionality will be a better choice
and we are planning to move that with proposed patch.

Thanks for your feedback.  We will be sending the final patch to upstream.
Now, queuing this change for internal testing.


>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke  Teamlead Storage & Networking
> h...@suse.de +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284
> (AG
> Nürnberg)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] preview - block layer help to detect sequential IO

2017-01-11 Thread Kashyap Desai
Objective of this patch is - 

To move code used in bcache module in block layer which is used to find IO 
stream. 
Reference code @drivers/md/bcache/request.c check_should_bypass().
This is a high level patch for review and understand if it is worth to follow ?

As of now bcache module use this logic, but good to have it in block layer and 
expose function for external use.

In this patch, I move logic of sequential IO search in block layer and exposed 
function blk_queue_rq_seq_cutoff.
Low level driver just need to call if they want stream detection per request 
queue. 
For my testing I just added call blk_queue_rq_seq_cutoff(sdev->request_queue, 
4) megaraid_sas driver.
 
In general, code of bcache module was referred and they are doing almost same 
as what we want to do in 
megaraid_sas driver below patch -

http://marc.info/?l=linux-scsi&m=148245616108288&w=2
 
bcache implementation use search algorithm (hashed based on bio start sector)
and detects 128 streams.  wanted those implementation to skip 
sequential IO 
to be placed on SSD and move it direct to the HDD. 

Will it be good design to keep this algorithm open at block layer (as proposed 
in patch.) ?

Signed-off-by: Kashyap desai 
---
diff --git a/block/blk-core.c b/block/blk-core.c
index 14d7c07..2e93d14 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -693,6 +693,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
 {
struct request_queue *q;
int err;
+   struct seq_io_tracker *io;
 
q = kmem_cache_alloc_node(blk_requestq_cachep,
gfp_mask | __GFP_ZERO, node_id);
@@ -761,6 +762,15 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
 
if (blkcg_init_queue(q))
goto fail_ref;
+   
+   q->sequential_cutoff = 0;
+   spin_lock_init(&q->io_lock);
+   INIT_LIST_HEAD(&q->io_lru);
+
+   for (io = q->io; io < q->io + BLK_RECENT_IO; io++) {
+   list_add(&io->lru, &q->io_lru);
+   hlist_add_head(&io->hash, q->io_hash + BLK_RECENT_IO);
+   }
 
return q;
 
@@ -1876,6 +1886,26 @@ static inline int bio_check_eod(struct bio *bio, 
unsigned int nr_sectors)
return 0;
 }
 
+static void add_sequential(struct task_struct *t)
+{
+#define blk_ewma_add(ewma, val, weight, factor) \
+({  \
+(ewma) *= (weight) - 1; \
+(ewma) += (val) << factor;  \
+(ewma) /= (weight); \
+(ewma) >> factor;   \
+})
+
+   blk_ewma_add(t->sequential_io_avg,
+t->sequential_io, 8, 0);
+
+   t->sequential_io = 0;
+}
+static struct hlist_head *blk_iohash(struct request_queue *q, uint64_t k)
+{
+   return &q->io_hash[hash_64(k, BLK_RECENT_IO_BITS)];
+}
+
 static noinline_for_stack bool
 generic_make_request_checks(struct bio *bio)
 {
@@ -1884,6 +1914,7 @@ static inline int bio_check_eod(struct bio *bio, unsigned 
int nr_sectors)
int err = -EIO;
char b[BDEVNAME_SIZE];
struct hd_struct *part;
+   struct task_struct *task = current;
 
might_sleep();
 
@@ -1957,6 +1988,42 @@ static inline int bio_check_eod(struct bio *bio, 
unsigned int nr_sectors)
if (!blkcg_bio_issue_check(q, bio))
return false;
 
+   if (q->sequential_cutoff) {
+   struct seq_io_tracker *i;
+   unsigned sectors;
+
+   spin_lock(&q->io_lock);
+
+   hlist_for_each_entry(i, blk_iohash(q, bio->bi_iter.bi_sector), 
hash)
+   if (i->last == bio->bi_iter.bi_sector &&
+   time_before(jiffies, i->jiffies))
+   goto found;
+
+   i = list_first_entry(&q->io_lru, struct seq_io_tracker, lru);
+
+   add_sequential(task);
+   i->sequential = 0;
+found:
+   if (i->sequential + bio->bi_iter.bi_size > i->sequential)
+   i->sequential   += bio->bi_iter.bi_size;
+
+   i->last  = bio_end_sector(bio);
+   i->jiffies   = jiffies + msecs_to_jiffies(5000);
+   task->sequential_io  = i->sequential;
+
+   hlist_del(&i->hash);
+   hlist_add_head(&i->hash, blk_iohash(q, i->last));
+   list_move_tail(&i->lru, &q->io_lru);
+
+   spin_unlock(&q->io_lock);
+
+   sectors = max(task->sequential_io,
+ task->sequential_io_avg) >> 9;
+   

RE: [PATCH] preview - block layer help to detect sequential IO

2017-01-12 Thread Kashyap Desai
> -Original Message-
> From: kbuild test robot [mailto:l...@intel.com]
> Sent: Thursday, January 12, 2017 1:18 AM
> To: Kashyap Desai
> Cc: kbuild-...@01.org; linux-scsi@vger.kernel.org;
linux-bl...@vger.kernel.org;
> ax...@kernel.dk; martin.peter...@oracle.com; j...@linux.vnet.ibm.com;
> sumit.sax...@broadcom.com; Kashyap desai
> Subject: Re: [PATCH] preview - block layer help to detect sequential IO
>
> Hi Kashyap,
>
> [auto build test ERROR on v4.9-rc8]
> [cannot apply to block/for-next linus/master linux/master next-20170111]
[if
> your patch is applied to the wrong git tree, please drop us a note to
help
> improve the system]
>
> url:
https://github.com/0day-ci/linux/commits/Kashyap-Desai/preview-block-
> layer-help-to-detect-sequential-IO/20170112-024228
> config: i386-randconfig-a0-201702 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386
>
> All errors (new ones prefixed by >>):
>
>block/blk-core.c: In function 'add_sequential':
> >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
>  blk_ewma_add(t->sequential_io_avg,


This error fixable. For now, I just wanted to get high level review of the
idea.
Below defines are required to use sequential_io and sequential_io_avg. I
have enable BCACHE for my testing in .config.

#if defined(CONFIG_BCACHE) || defined(CONFIG_BCACHE_MODULE)
unsigned intsequential_io;
unsigned intsequential_io_avg;
#endif

Looking for high level review comment.

` Kashyap


>^
>block/blk-core.c:1893:10: note: in definition of macro 'blk_ewma_add'
> (ewma) *= (weight) - 1;
\
>  ^~~~
> >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
>  blk_ewma_add(t->sequential_io_avg,
>^
>block/blk-core.c:1894:10: note: in definition of macro 'blk_ewma_add'
> (ewma) += (val) << factor;
\
>  ^~~~
> >> block/blk-core.c:1900:5: error: 'struct task_struct' has no member
named
> 'sequential_io'
>t->sequential_io, 8, 0);
> ^
>block/blk-core.c:1894:20: note: in definition of macro 'blk_ewma_add'
> (ewma) += (val) << factor;
\
>^~~
> >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
>  blk_ewma_add(t->sequential_io_avg,
>^
>block/blk-core.c:1895:10: note: in definition of macro 'blk_ewma_add'
> (ewma) /= (weight);
\
>  ^~~~
> >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
>  blk_ewma_add(t->sequential_io_avg,
>^
>block/blk-core.c:1896:10: note: in definition of macro 'blk_ewma_add'
> (ewma) >> factor;
\
>  ^~~~
>block/blk-core.c:1902:3: error: 'struct task_struct' has no member
named
> 'sequential_io'
>  t->sequential_io = 0;
>   ^~
>block/blk-core.c: In function 'generic_make_request_checks':
>block/blk-core.c:2012:7: error: 'struct task_struct' has no member
named
> 'sequential_io'
>   task->sequential_io  = i->sequential;
>   ^~
>In file included from block/blk-core.c:14:0:
>block/blk-core.c:2020:21: error: 'struct task_struct' has no member
named
> 'sequential_io'
>   sectors = max(task->sequential_io,
> ^
>include/linux/kernel.h:747:2: note: in definition of macro '__max'
>  t1 max1 = (x); \
>  ^~
>block/blk-core.c:2020:13: note: in expansion of macro 'max'
>   sectors = max(task->sequential_io,
> ^~~
>block/blk-core.c:2020:21: error: 'struct task_struct' has no member
named
> 'sequential_io'
>   sectors = max(task->sequential_io,
> ^
>include/linux/kernel.h:747:13: note: in definition of macro '__max'
>  t1 max1 = (x); \
> ^
>block/blk-core.c:2020:13: note: in expansion of macro 'max'
>   sectors = max(task->sequential_io,
> ^~~
>block/blk-core.c:2021:14: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
> 

RE: [PATCH] preview - block layer help to detect sequential IO

2017-01-16 Thread Kashyap Desai
> Hi, Kashyap,
>
> I'm CC-ing Kent, seeing how this is his code.

Hi Jeff and Kent, See my reply inline.

>
> Kashyap Desai  writes:
>
> > Objective of this patch is -
> >
> > To move code used in bcache module in block layer which is used to
> > find IO stream.  Reference code @drivers/md/bcache/request.c
> > check_should_bypass().  This is a high level patch for review and
> > understand if it is worth to follow ?
> >
> > As of now bcache module use this logic, but good to have it in block
> > layer and expose function for external use.
> >
> > In this patch, I move logic of sequential IO search in block layer and
> > exposed function blk_queue_rq_seq_cutoff.  Low level driver just need
> > to call if they want stream detection per request queue.  For my
> > testing I just added call blk_queue_rq_seq_cutoff(sdev->request_queue,
> > 4) megaraid_sas driver.
> >
> > In general, code of bcache module was referred and they are doing
> > almost same as what we want to do in megaraid_sas driver below patch -
> >
> > http://marc.info/?l=linux-scsi&m=148245616108288&w=2
> >
> > bcache implementation use search algorithm (hashed based on bio start
> > sector) and detects 128 streams.  wanted those implementation
> > to skip sequential IO to be placed on SSD and move it direct to the
> > HDD.
> >
> > Will it be good design to keep this algorithm open at block layer (as
> > proposed in patch.) ?
>
> It's almost always a good idea to avoid code duplication, but this patch
> definitely needs some work.

Jeff, I was not aware of the actual block layer module, so created just a
working patch to explain my point.
Check new patch. This patch is driver changes only in 
driver.

1. Below MR driver patch does similar things but code is Array base linear
lookup.
 http://marc.info/?l=linux-scsi&m=148245616108288&w=2

2. I thought to improve this using appended patch. It is similar of what
 is doing. This patch has duplicate code as  is doing the
same.

>
> I haven't looked terribly closely at the bcache implementaiton, so do
let me
> know if I've misinterpreted something.
>
> We should track streams per io_context/queue pair.  We already have a
data
> structure for that, the io_cq.  Right now that structure is tailored for
use by the
> I/O schedulers, but I'm sure we could rework that.  That would also get
rid of the
> tremedous amount of bloat this patch adds to the request_queue.  It will
also
> allow us to remove the bcache-specific fields that were added to
task_struct.
> Overall, it should be a good simplification, unless I've completely
missed the
> point (which happens).

Your understanding of requirement is correct. What we need is tracker of
 in block layer and check the tracker for every request to know
if this is a random or sequential IO.  As you explained, there is a
similar logic in  ..I search the kernel code and figure out below
code section @ block/elevator.c

/*
 * See if our hash lookup can find a potential backmerge.
 */
__rq = elv_rqhash_find(q, bio->bi_iter.bi_sector);


I am looking for similar logic done in elv_rqhash_find() for all the IOs
and provide information in request, if this particular request is a
potential back-merge candidate (Having new req_flags_t e.a  RQF_SEQ) . It
is OK, even thought it was not merged due to other checks in IO path.

Safer side (to avoid any performance issues), we can opt for API to be
called by low level driver on particular request queue/sdev, if someone is
interested in this request queue such help ?

I need help (some level of patch to work on) or pointer, if this path is
good. I can drive this, but need to understand direction.

>
> I don't like that you put sequential I/O detection into bio_check_eod.
> Split it out into its own function.

Sorry for this. I thought of sending patch to get better understanding. My
first patch was very high level and not complaint with many design or
coding issue.
For my learning - BTW, for such post (if I have high level patch) ..what
shall I do ?

> You've added a member to struct bio that isn't referenced.  It would
have been
> nice of you to put enough work into this RFC so that we could at least
see how
> the common code was used by bcache and your driver.

See my second patch appended here. I can work on block layer generic
changes, if we have some another area (as mentioned elevator/cfq) doing
the stuffs which I am looking for.

>
> EWMA (exponentially weighted moving average) is not an acronym I keep
handy
> in my head.  It would be nice to add documentation on the algorithm and
design
> choices.  More comments in the code would also be appreciated.  CFQ does
> some 

RE: Device or HBA level QD throttling creates randomness in sequetial workload

2017-01-30 Thread Kashyap Desai
Hi Jens/Omar,

I used git.kernel.dk/linux-block branch - blk-mq-sched (commit
0efe27068ecf37ece2728a99b863763286049ab5) and confirm that issue reported in
this thread is resolved.

Now I am seeing MQ and  SQ mode both are resulting in sequential IO pattern
while IO is getting re-queued in block layer.

To make similar performance without blk-mq-sched feature, is it good to
pause IO for few usec in LLD?
I mean, I want to avoid driver asking SML/Block layer to re-queue the IO (if
it is Sequential on Rotational media.)

Explaining w.r.t megaraid_sas driver.  This driver expose can_queue, but it
internally consume commands for raid 1, fast  path.
In worst case, can_queue/2 will consume all firmware resources and driver
will re-queue further IOs to SML as below -

   if (atomic_inc_return(&instance->fw_outstanding) >
   instance->host->can_queue) {
   atomic_dec(&instance->fw_outstanding);
   return SCSI_MLQUEUE_HOST_BUSY;
   }

I want to avoid above SCSI_MLQUEUE_HOST_BUSY.

Need your suggestion for below changes -

diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index 9a9c84f..a683eb0 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -54,6 +54,7 @@
 #include 
 #include 
 #include 
+#include 

 #include "megaraid_sas_fusion.h"
 #include "megaraid_sas.h"
@@ -2572,7 +2573,15 @@ void megasas_prepare_secondRaid1_IO(struct
megasas_instance *instance,
struct megasas_cmd_fusion *cmd, *r1_cmd = NULL;
union MEGASAS_REQUEST_DESCRIPTOR_UNION *req_desc;
u32 index;
-   struct fusion_context *fusion;
+   boolis_nonrot;
+   u32 safe_can_queue;
+   u32 num_cpus;
+   struct fusion_context *fusion;
+
+   fusion = instance->ctrl_context;
+
+   num_cpus = num_online_cpus();
+   safe_can_queue = instance->cur_can_queue - num_cpus;

fusion = instance->ctrl_context;

@@ -2584,11 +2593,15 @@ void megasas_prepare_secondRaid1_IO(struct
megasas_instance *instance,
return SCSI_MLQUEUE_DEVICE_BUSY;
}

-   if (atomic_inc_return(&instance->fw_outstanding) >
-   instance->host->can_queue) {
-   atomic_dec(&instance->fw_outstanding);
-   return SCSI_MLQUEUE_HOST_BUSY;
-   }
+   if (atomic_inc_return(&instance->fw_outstanding) > safe_can_queue) {
+   is_nonrot = blk_queue_nonrot(scmd->device->request_queue);
+   /* For rotational device wait for sometime to get fusion command
from pool.
+* This is just to reduce proactive re-queue at mid layer which is
not
+* sending sorted IO in SCSI.MQ mode.
+*/
+   if (!is_nonrot)
+   udelay(100);
+   }

cmd = megasas_get_cmd_fusion(instance, scmd->request->tag);

` Kashyap

> -Original Message-
> From: Kashyap Desai [mailto:kashyap.de...@broadcom.com]
> Sent: Tuesday, November 01, 2016 11:11 AM
> To: 'Jens Axboe'; 'Omar Sandoval'
> Cc: 'linux-scsi@vger.kernel.org'; 'linux-ker...@vger.kernel.org'; 'linux-
> bl...@vger.kernel.org'; 'Christoph Hellwig'; 'paolo.vale...@linaro.org'
> Subject: RE: Device or HBA level QD throttling creates randomness in
> sequetial workload
>
> Jens- Replied inline.
>
>
> Omar -  I tested your WIP repo and figure out System hangs only if I pass
> "
> scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I
> am looking for scsi_mod.use_blk_mq=Y.
>
> Also below is snippet of blktrace. In case of higher per device QD, I see
> Requeue request in blktrace.
>
> 65,128 10 6268 2.432404509 18594  P   N [fio]
>  65,128 10 6269 2.432405013 18594  U   N [fio] 1
>  65,128 10 6270 2.432405143 18594  I  WS 148800 + 8 [fio]
>  65,128 10 6271 2.432405740 18594  R  WS 148800 + 8 [0]
>  65,128 10 6272 2.432409794 18594  Q  WS 148808 + 8 [fio]
>  65,128 10 6273 2.432410234 18594  G  WS 148808 + 8 [fio]
>  65,128 10 6274 2.432410424 18594  S  WS 148808 + 8 [fio]
>  65,128 23 3626 2.432432595 16232  D  WS 148800 + 8
> [kworker/23:1H]
>  65,128 22 3279 2.432973482 0  C  WS 147432 + 8 [0]
>  65,128  7 6126 2.433032637 18594  P   N [fio]
>  65,128  7 6127 2.433033204 18594  U   N [fio] 1
>  65,128  7 6128 2.433033346 18594  I  WS 148808 + 8 [fio]
>  65,128  7 6129 2.433033871 18594  D  WS 148808 + 8 [fio]
>  65,128  7 6130 2.433034559 18594  R  WS 148808 + 8 [0]
>  65,128  7 6131 2.433039796 18594  Q  WS 148816 + 8 [fio]
>  65,128  7 6132 2.433040206 18594  G  WS 148816 + 8 [fio]
>  65,128  7 6133 2.433040351 18594  S  WS 148816 + 8 [fio]
>  65,128  9 6392 2.433133729 0  C  WS 147240 + 8 [0]
>  65,128  9 6393

RE: Device or HBA level QD throttling creates randomness in sequetial workload

2017-01-30 Thread Kashyap Desai
> -Original Message-
> From: Jens Axboe [mailto:ax...@kernel.dk]
> Sent: Monday, January 30, 2017 10:03 PM
> To: Bart Van Assche; osan...@osandov.com; kashyap.de...@broadcom.com
> Cc: linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org;
> h...@infradead.org; linux-bl...@vger.kernel.org; paolo.vale...@linaro.org
> Subject: Re: Device or HBA level QD throttling creates randomness in
> sequetial workload
>
> On 01/30/2017 09:30 AM, Bart Van Assche wrote:
> > On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote:
> >> -   if (atomic_inc_return(&instance->fw_outstanding) >
> >> -   instance->host->can_queue) {
> >> -   atomic_dec(&instance->fw_outstanding);
> >> -   return SCSI_MLQUEUE_HOST_BUSY;
> >> -   }
> >> +   if (atomic_inc_return(&instance->fw_outstanding) >
safe_can_queue) {
> >> +   is_nonrot = blk_queue_nonrot(scmd->device->request_queue);
> >> +   /* For rotational device wait for sometime to get fusion
> >> + command
> >> from pool.
> >> +* This is just to reduce proactive re-queue at mid layer
> >> + which is
> >> not
> >> +* sending sorted IO in SCSI.MQ mode.
> >> +*/
> >> +   if (!is_nonrot)
> >> +   udelay(100);
> >> +   }
> >
> > The SCSI core does not allow to sleep inside the queuecommand()
> > callback function.
>
> udelay() is a busy loop, so it's not sleeping. That said, it's obviously
NOT a
> great idea. We want to fix the reordering due to requeues, not introduce
> random busy delays to work around it.

Thanks for feedback. I do realize that udelay() is going to be very odd
in queue_command call back.   I will keep this note. Preferred solution is
blk mq scheduler patches.
>
> --
> Jens Axboe
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 00/10] mpt3sas: full mq support

2017-01-31 Thread Kashyap Desai
> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Tuesday, January 31, 2017 4:47 PM
> To: Christoph Hellwig
> Cc: Martin K. Petersen; James Bottomley; linux-scsi@vger.kernel.org;
Sathya
> Prakash; Kashyap Desai; mpt-fusionlinux@broadcom.com
> Subject: Re: [PATCH 00/10] mpt3sas: full mq support
>
> On 01/31/2017 11:02 AM, Christoph Hellwig wrote:
> > On Tue, Jan 31, 2017 at 10:25:50AM +0100, Hannes Reinecke wrote:
> >> Hi all,
> >>
> >> this is a patchset to enable full multiqueue support for the mpt3sas
> driver.
> >> While the HBA only has a single mailbox register for submitting
> >> commands, it does have individual receive queues per MSI-X interrupt
> >> and as such does benefit from converting it to full multiqueue
support.
> >
> > Explanation and numbers on why this would be beneficial, please.
> > We should not need multiple submissions queues for a single register
> > to benefit from multiple completion queues.
> >
> Well, the actual throughput very strongly depends on the blk-mq-sched
> patches from Jens.
> As this is barely finished I didn't post any numbers yet.
>
> However:
> With multiqueue support:
> 4k seq read : io=60573MB, bw=1009.2MB/s, iops=258353, runt= 60021msec
> With scsi-mq on 1 queue:
> 4k seq read : io=17369MB, bw=296291KB/s, iops=74072, runt= 60028msec So
> yes, there _is_ a benefit.
>
> (Which is actually quite cool, as these tests were done on a SAS3 HBA,
so
> we're getting close to the theoretical maximum of 1.2GB/s).
> (Unlike the single-queue case :-)

Hannes -

Can you share detail about setup ? How many drives do you have and how is
connection (enclosure -> drives. ??) ?
To me it looks like current mpt3sas driver might be taking more hit in
spinlock operation (penalty on NUMA arch is more compare to single core
server) unlike we have in megaraid_sas driver use of shared blk tag.

I mean " [PATCH 08/10] mpt3sas: lockless command submission for scsi-mq"
patch is improving performance removing spinlock overhead and attempting
to get request using blk_tags.
Are you seeing performance improvement  if you hard code nr_hw_queues = 1
in below code changes part of "[PATCH 10/10] mpt3sas: scsi-mq interrupt
steering"

@@ -9054,6 +9071,8 @@ static void sas_device_make_active(struct
MPT3SAS_ADAPTER *ioc,
shost->max_lun = max_lun;
shost->transportt = mpt3sas_transport_template;
shost->unique_id = ioc->id;
+   if (shost->use_blk_mq)
+   shost->nr_hw_queues = num_online_cpus();


Thanks, Kashyap

>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke  Teamlead Storage & Networking
> h...@suse.de +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB
21284
> (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 00/10] mpt3sas: full mq support

2017-01-31 Thread Kashyap Desai
> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Wednesday, February 01, 2017 12:21 PM
> To: Kashyap Desai; Christoph Hellwig
> Cc: Martin K. Petersen; James Bottomley; linux-scsi@vger.kernel.org;
> Sathya
> Prakash Veerichetty; PDL-MPT-FUSIONLINUX; Sreekanth Reddy
> Subject: Re: [PATCH 00/10] mpt3sas: full mq support
>
> On 01/31/2017 06:54 PM, Kashyap Desai wrote:
> >> -Original Message-
> >> From: Hannes Reinecke [mailto:h...@suse.de]
> >> Sent: Tuesday, January 31, 2017 4:47 PM
> >> To: Christoph Hellwig
> >> Cc: Martin K. Petersen; James Bottomley; linux-scsi@vger.kernel.org;
> > Sathya
> >> Prakash; Kashyap Desai; mpt-fusionlinux@broadcom.com
> >> Subject: Re: [PATCH 00/10] mpt3sas: full mq support
> >>
> >> On 01/31/2017 11:02 AM, Christoph Hellwig wrote:
> >>> On Tue, Jan 31, 2017 at 10:25:50AM +0100, Hannes Reinecke wrote:
> >>>> Hi all,
> >>>>
> >>>> this is a patchset to enable full multiqueue support for the
> >>>> mpt3sas
> >> driver.
> >>>> While the HBA only has a single mailbox register for submitting
> >>>> commands, it does have individual receive queues per MSI-X
> >>>> interrupt and as such does benefit from converting it to full
> >>>> multiqueue
> > support.
> >>>
> >>> Explanation and numbers on why this would be beneficial, please.
> >>> We should not need multiple submissions queues for a single register
> >>> to benefit from multiple completion queues.
> >>>
> >> Well, the actual throughput very strongly depends on the blk-mq-sched
> >> patches from Jens.
> >> As this is barely finished I didn't post any numbers yet.
> >>
> >> However:
> >> With multiqueue support:
> >> 4k seq read : io=60573MB, bw=1009.2MB/s, iops=258353, runt=
> 60021msec
> >> With scsi-mq on 1 queue:
> >> 4k seq read : io=17369MB, bw=296291KB/s, iops=74072, runt= 60028msec
> >> So yes, there _is_ a benefit.
> >>
> >> (Which is actually quite cool, as these tests were done on a SAS3
> >> HBA,
> > so
> >> we're getting close to the theoretical maximum of 1.2GB/s).
> >> (Unlike the single-queue case :-)
> >
> > Hannes -
> >
> > Can you share detail about setup ? How many drives do you have and how
> > is connection (enclosure -> drives. ??) ?
> > To me it looks like current mpt3sas driver might be taking more hit in
> > spinlock operation (penalty on NUMA arch is more compare to single
> > core
> > server) unlike we have in megaraid_sas driver use of shared blk tag.
> >
> The tests were done with a single LSI SAS3008 connected to a NetApp E-
> series (2660), using 4 LUNs under MD-RAID0.
>
> Megaraid_sas is even worse here; due to the odd nature of the 'fusion'
> implementation we're ending up having _two_ sets of tags, making it really
> hard to use scsi-mq here.

Current megaraid_sas as single submission queue exposed to the blk-mq will
not encounter similar performance issue.
We may not see significant improvement of performance if we attempt the same
for megaraid_sas driver.
We had similar discussion for megaraid_sas and hpsa.
http://www.spinics.net/lists/linux-scsi/msg101838.html

I am seeing this patch series is similar attempt for mpt3sas..Am I missing
anything ?

Megaraid_sas driver just do indexing from blk_tag and fire IO quick enough
unlike mpt3sas where we have  lock contention @driver level as bottleneck.

> (Not that I didn't try; but lacking a proper backend it's really hard to
> evaluate
> the benefit of those ... spinning HDDs simply don't cut it here)
>
> > I mean " [PATCH 08/10] mpt3sas: lockless command submission for scsi-
> mq"
> > patch is improving performance removing spinlock overhead and
> > attempting to get request using blk_tags.
> > Are you seeing performance improvement  if you hard code nr_hw_queues
> > = 1 in below code changes part of "[PATCH 10/10] mpt3sas: scsi-mq
> > interrupt steering"
> >
> No. The numbers posted above are generated with exactly that patch; the
> first line is running with nr_hw_queues=32 and the second line with
> nr_hw_queues=1.

Thanks Hannes. That clarifies.  Can you share  script you have used ?

If my  understanding correct, you will see theoretical maximum of 1.2GBp/s
if you restrict your work load to single numa node. This is just for
understanding if  driver spinlocks are adding overhead. We have
seen such overhead on multi-socket server an

RE: [PATCH 03/39] megaraid_sas: raid 1 fast path code optimize

2017-02-06 Thread Kashyap Desai
> >
> >  /**
> > + * megasas_complete_r1_command -
> > + * completes R1 FP write commands which has valid peer smid
> > + * @instance:  Adapter soft state
> > + * @cmd_fusion:MPT command frame
> > + *
> > + */
> > +static inline void
> > +megasas_complete_r1_command(struct megasas_instance *instance,
> > +   struct megasas_cmd_fusion *cmd) {
> > +   u8 *sense, status, ex_status;
> > +   u32 data_length;
> > +   u16 peer_smid;
> > +   struct fusion_context *fusion;
> > +   struct megasas_cmd_fusion *r1_cmd = NULL;
> > +   struct scsi_cmnd *scmd_local = NULL;
> > +   struct RAID_CONTEXT_G35 *rctx_g35;
> > +
> > +   rctx_g35 = &cmd->io_request->RaidContext.raid_context_g35;
> > +   fusion = instance->ctrl_context;
> > +   peer_smid = le16_to_cpu(rctx_g35->smid.peer_smid);
> > +
> > +   r1_cmd = fusion->cmd_list[peer_smid - 1];
> > +   scmd_local = cmd->scmd;
> > +   status = rctx_g35->status;
> > +   ex_status = rctx_g35->ex_status;
> > +   data_length = cmd->io_request->DataLength;
> > +   sense = cmd->sense;
> > +
> > +   cmd->cmd_completed = true;
> > +
> > +   /* Check if peer command is completed or not*/
> > +   if (r1_cmd->cmd_completed) {
> > +   if (rctx_g35->status != MFI_STAT_OK) {
> > +   status = rctx_g35->status;
> > +   ex_status = rctx_g35->ex_status;
>
> Both status + ex_status were already set to the same value, why is it
> repeated here ?

Tomas, This need a fix. Raid context should be switch to r1_cmd, but it
that is not done here.
We want if r1 cmd is completed with failure, check status and extended
status from r1_cmd to send final status to mid layer.

We will fix this and resend patch. It will be like this -

if (r1_cmd->cmd_completed) {
rctx_g35 =
&r1_cmd->io_request->RaidContext.raid_context_g35;<< -This line
should be added.
if (rctx_g35->status != MFI_STAT_OK) {
status = rctx_g35->status;
ex_status = rctx_g35->ex_status;

Thanks, Kashyap

>
> Tomas
>


RE: [PATCH 33/39] megaraid_sas: call flush_scheduled_work during controller shutdown/detach

2017-02-06 Thread Kashyap Desai
> -Original Message-
> From: Tomas Henzl [mailto:the...@redhat.com]
> Sent: Monday, February 06, 2017 9:35 PM
> To: Shivasharan S; linux-scsi@vger.kernel.org
> Cc: martin.peter...@oracle.com; j...@linux.vnet.ibm.com;
> kashyap.de...@broadcom.com; sumit.sax...@broadcom.com;
> h...@suse.com
> Subject: Re: [PATCH 33/39] megaraid_sas: call flush_scheduled_work
during
> controller shutdown/detach
>
> On 6.2.2017 11:00, Shivasharan S wrote:
> > Signed-off-by: Kashyap Desai 
> > Signed-off-by: Shivasharan S
> 
> > ---
> >  drivers/scsi/megaraid/megaraid_sas_base.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> b/drivers/scsi/megaraid/megaraid_sas_base.c
> > index 04ef0a0..b29cfd3 100644
> > --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> > @@ -6393,6 +6393,7 @@ megasas_suspend(struct pci_dev *pdev,
> pm_message_t state)
> > if (instance->ev != NULL) {
> > struct megasas_aen_event *ev = instance->ev;
> > cancel_delayed_work_sync(&ev->hotplug_work);
> > +   flush_scheduled_work();
> > instance->ev = NULL;
> > }
> >
> > @@ -6619,6 +6620,7 @@ static void megasas_detach_one(struct pci_dev
> *pdev)
> > if (instance->ev != NULL) {
> > struct megasas_aen_event *ev = instance->ev;
> > cancel_delayed_work_sync(&ev->hotplug_work);
> > +   flush_scheduled_work();
> > instance->ev = NULL;
> > }
> >
>
> Why is cancel_delayed_work_sync not good enough?

Megaraid_sas driver use certain work on global work queue.

Below are the listed one -

if (instance->ctrl_context) {
INIT_WORK(&instance->work_init, megasas_fusion_ocr_wq);
INIT_WORK(&instance->crash_init,
megasas_fusion_crash_dump_wq);
}
else
INIT_WORK(&instance->work_init,
process_fw_state_change_wq)

Cancel_delayed_work_sync() was mainly targeted for only hotplug AEN work.
Calling flush_scheduled_work() we want above listed work to be completed
as well.

>
> tomash


RE: [PATCH 33/39] megaraid_sas: call flush_scheduled_work during controller shutdown/detach

2017-02-07 Thread Kashyap Desai
> -Original Message-
> From: Kashyap Desai [mailto:kashyap.de...@broadcom.com]
> Sent: Monday, February 06, 2017 10:48 PM
> To: 'Tomas Henzl'; Shivasharan Srikanteshwara;
'linux-scsi@vger.kernel.org'
> Cc: 'martin.peter...@oracle.com'; 'j...@linux.vnet.ibm.com'; Sumit
Saxena;
> 'h...@suse.com'
> Subject: RE: [PATCH 33/39] megaraid_sas: call flush_scheduled_work
during
> controller shutdown/detach
>
> > -Original Message-
> > From: Tomas Henzl [mailto:the...@redhat.com]
> > Sent: Monday, February 06, 2017 9:35 PM
> > To: Shivasharan S; linux-scsi@vger.kernel.org
> > Cc: martin.peter...@oracle.com; j...@linux.vnet.ibm.com;
> > kashyap.de...@broadcom.com; sumit.sax...@broadcom.com;
> h...@suse.com
> > Subject: Re: [PATCH 33/39] megaraid_sas: call flush_scheduled_work
> > during controller shutdown/detach
> >
> > On 6.2.2017 11:00, Shivasharan S wrote:
> > > Signed-off-by: Kashyap Desai 
> > > Signed-off-by: Shivasharan S
> > 
> > > ---
> > >  drivers/scsi/megaraid/megaraid_sas_base.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> > b/drivers/scsi/megaraid/megaraid_sas_base.c
> > > index 04ef0a0..b29cfd3 100644
> > > --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> > > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> > > @@ -6393,6 +6393,7 @@ megasas_suspend(struct pci_dev *pdev,
> > pm_message_t state)
> > >   if (instance->ev != NULL) {
> > >   struct megasas_aen_event *ev = instance->ev;
> > >   cancel_delayed_work_sync(&ev->hotplug_work);
> > > + flush_scheduled_work();
> > >   instance->ev = NULL;
> > >   }
> > >
> > > @@ -6619,6 +6620,7 @@ static void megasas_detach_one(struct pci_dev
> > *pdev)
> > >   if (instance->ev != NULL) {
> > >   struct megasas_aen_event *ev = instance->ev;
> > >   cancel_delayed_work_sync(&ev->hotplug_work);
> > > + flush_scheduled_work();
> > >   instance->ev = NULL;
> > >   }
> > >
> >
> > Why is cancel_delayed_work_sync not good enough?
>
> Megaraid_sas driver use certain work on global work queue.
>
> Below are the listed one -
>
>   if (instance->ctrl_context) {
>   INIT_WORK(&instance->work_init, megasas_fusion_ocr_wq);
>   INIT_WORK(&instance->crash_init,
> megasas_fusion_crash_dump_wq);
>   }
>   else
>   INIT_WORK(&instance->work_init,
> process_fw_state_change_wq)
>
> Cancel_delayed_work_sync() was mainly targeted for only hotplug AEN
work.
> Calling flush_scheduled_work() we want above listed work to be completed
> as well.

Tomas - Here is one more update. I agree with your assessment. We don't
need this patch.

In our local repo code was like below and as part of sync up activity, I
did not realize that upstream is using cancel_delayed_work_sync() which is
internally doing the same as below.

cancel_delayed_work(&ev->hotplug_work);
flush_scheduled_work();

Just for info - Similar patch was posted for mpt2sas long time ago to
replace above combination with cancel_delayed_work_sync()

https://lkml.org/lkml/2010/12/21/127

We will accommodate removal of this patch in V2 submission.



>
> >
> > tomash


RE: [PATCH 13/39] megaraid_sas : set residual bytes count during IO compeltion

2017-02-07 Thread Kashyap Desai
> -Original Message-
> From: Martin K. Petersen [mailto:martin.peter...@oracle.com]
> Sent: Tuesday, February 07, 2017 5:22 AM
> To: Shivasharan S
> Cc: linux-scsi@vger.kernel.org; martin.peter...@oracle.com;
> the...@redhat.com; j...@linux.vnet.ibm.com;
> kashyap.de...@broadcom.com; sumit.sax...@broadcom.com;
> h...@suse.com
> Subject: Re: [PATCH 13/39] megaraid_sas : set residual bytes count
during IO
> compeltion
>
> > "Shivasharan" == Shivasharan S
>  writes:
>
> Shivasharan> Fixing issue of not setting residual bytes correctly.
>
> @@ -1464,6 +1465,15 @@ map_cmd_status(struct fusion_context *fusion,
>  SCSI_SENSE_BUFFERSIZE);
>   scmd->result |= DRIVER_SENSE << 24;
>   }
> +
> + /*
> +  * If the  IO request is partially completed, then MR FW
will
> +  * update "io_request->DataLength" field with actual
number
> of
> +  * bytes transferred.Driver will set residual bytes count
in
> +  * SCSI command structure.
> +  */
> + resid = (scsi_bufflen(scmd) - data_length);
> + scsi_set_resid(scmd, resid);
>
> Is data_length guaranteed to be a multiple of the logical block size?
> Otherwise you need to tweak the residual like we just did for mpt3sas.

Martin, Data length will be always guaranteed to be a multiple of the
logical block size until and unless we have some firmware defect.
In past, We have seen  some partial/complete DMA data length return from
firmware was not aligned with logical block size. Eventually, root caused
+ fixed in firmware.

>
> --
> Martin K. PetersenOracle Linux Engineering


RE: [PATCH v2 03/39] megaraid_sas: raid 1 fast path code optimize

2017-02-08 Thread Kashyap Desai
> > +static inline void
> > +megasas_complete_r1_command(struct megasas_instance *instance,
> > +   struct megasas_cmd_fusion *cmd) {
> > +   u8 *sense, status, ex_status;
> > +   u32 data_length;
> > +   u16 peer_smid;
> > +   struct fusion_context *fusion;
> > +   struct megasas_cmd_fusion *r1_cmd = NULL;
> > +   struct scsi_cmnd *scmd_local = NULL;
> > +   struct RAID_CONTEXT_G35 *rctx_g35;
> > +
> > +   rctx_g35 = &cmd->io_request->RaidContext.raid_context_g35;
> > +   fusion = instance->ctrl_context;
> > +   peer_smid = le16_to_cpu(rctx_g35->smid.peer_smid);
> > +
> > +   r1_cmd = fusion->cmd_list[peer_smid - 1];
> > +   scmd_local = cmd->scmd;
> > +   status = rctx_g35->status;
> > +   ex_status = rctx_g35->ex_status;
> > +   data_length = cmd->io_request->DataLength;
> > +   sense = cmd->sense;
> > +
> > +   cmd->cmd_completed = true;
>
> Please help me understand how this works
> - there are two peer commands sent to the controller
> - both are completed and the later calls scsi_done and returns both
r1_cmd
> + cmd
> - if both commands can be completed at the same time, is it possible
that
> the
>   above line is executed at the same moment for both completions ?
> How is the code  protected against a double completion when both
> completed commands see the peer cmd_completed as set ?


Tomas,  cmd and r1_cmd (part of  same Raid 1 FP) will be always completed
on same reply queue by firmware. That is one of the key requirement here
for raid 1 fast path.
What you ask is possible if FW completes cmd and r1_cmd on different reply
queue. If you notice when we clone r1_cmd, we also clone MSI-x index from
parent command.
So eventually, FW is aware of binding of both cmd and r1_cmd w.r.t reply
queue index.

` Kashyap

>
> > +


RE: [PATCH v2 19/39] megaraid_sas: MR_TargetIdToLdGet u8 to u16 and avoid invalid raid-map access

2017-02-09 Thread Kashyap Desai
> Signed-off-by: Shivasharan S 
> Signed-off-by: Kashyap Desai 

In this patch series, we are done with review but this particular patch
missed Review-by tag.

Kashyap


RE: [PATCH v2 21/39] megaraid_sas: big endian support changes

2017-02-09 Thread Kashyap Desai
> +static inline void set_num_sge(struct RAID_CONTEXT_G35 rctx_g35,
> +u16 sge_count)
> +{
> + rctx_g35.u.bytes[0] = (u8)(sge_count & NUM_SGE_MASK_LOWER);
> + rctx_g35.u.bytes[1] |= (u8)((sge_count >> NUM_SGE_SHIFT_UPPER)
> + &
> NUM_SGE_MASK_UPPER);
> +}

This function and below get_num_sge() need fix.  We have supposed to pass
pointer of struct RAID_CONTEXT_G35 to get correct setting reflected in IO
frame, otherwise it just set in stack local memory and that is not a
intent here. We will fix this patch and resend. Only fixing this patch and
resend works fine with complete series (there is no hunk failure observe),
so just going to push one particular patch with below title.

[PATCH v2 21/39 RESEND] megaraid_sas: big endian support changes

> +
> +static inline u16 get_num_sge(struct RAID_CONTEXT_G35 rctx_g35) {
> + u16 sge_count;
> +
> + sge_count = (u16)(((rctx_g35.u.bytes[1] & NUM_SGE_MASK_UPPER)
> + << NUM_SGE_SHIFT_UPPER) | (rctx_g35.u.bytes[0]));
> + return sge_count;
> +}
> +
> +#define SET_STREAM_DETECTED(rctx_g35) \
> + (rctx_g35.u.bytes[1] |= STREAM_DETECT_MASK)
> +
> +#define CLEAR_STREAM_DETECTED(rctx_g35) \
> + (rctx_g35.u.bytes[1] &= ~(STREAM_DETECT_MASK))
> +
> +static inline bool is_stream_detected(struct RAID_CONTEXT_G35
> +*rctx_g35) {
> + return ((rctx_g35->u.bytes[1] & STREAM_DETECT_MASK)); }
> +
>  union RAID_CONTEXT_UNION {
>   struct RAID_CONTEXT raid_context;
>   struct RAID_CONTEXT_G35 raid_context_g35;
> --
> 2.8.3


[PATCH] return valid data buffer length in scsi_bufflen() API using RQF_SPECIAL_PAYLOAD

2017-02-13 Thread Kashyap Desai
Regression due to  commit f9d03f96b988002027d4b28ea1b7a24729a4c9b5
block: improve handling of the magic discard payload

 and  HBA FW encounter FW fault in DMA operation
while creating File system on SSDs.
Below CDB cause FW fault.
CDB: Write same(16) 93 08 00 00 00 00 00 00 00 00 00 00 80 00 00 00

Root cause is SCSI buffer length and DMA buffer length miss match for
WRITE SAME command.

Fix - return valid data buffer length in scsi_bufflen() API using
RQF_SPECIAL_PAYLOAD

Signed-off-by: Kashyap Desai 
---
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index 9fc1aec..1f796fc 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -180,7 +180,8 @@ static inline struct scatterlist *scsi_sglist(struct
scsi_cmnd *cmd)

 static inline unsigned scsi_bufflen(struct scsi_cmnd *cmd)
 {
-   return cmd->sdb.length;
+   return (cmd->request->rq_flags & RQF_SPECIAL_PAYLOAD) ?
+   cmd->request->special_vec.bv_len :
cmd->sdb.length;
 }

 static inline void scsi_set_resid(struct scsi_cmnd *cmd, int resid)


RE: [PATCH 00/10] mpt3sas: full mq support

2017-02-15 Thread Kashyap Desai
>
>
> Hannes,
>
> Result I have posted last time is with merge operation enabled in block
> layer. If I disable merge operation then I don't see much improvement
> with
> multiple hw request queues. Here is the result,
>
> fio results when nr_hw_queues=1,
> 4k read when numjobs=24: io=248387MB, bw=1655.1MB/s, iops=423905,
> runt=150003msec
>
> fio results when nr_hw_queues=24,
> 4k read when numjobs=24: io=263904MB, bw=1759.4MB/s, iops=450393,
> runt=150001msec

Hannes -

 I worked with Sreekanth and also understand pros/cons of Patch #10.
" [PATCH 10/10] mpt3sas: scsi-mq interrupt steering"

In above patch, can_queue of HBA is divided based on logic CPU, it means we
want to mimic as if mpt3sas HBA support multi queue distributing actual
resources which is single Submission H/W Queue. This approach badly impact
many performance areas.

nr_hw_queues = 1 is what I observe as best performance approach since it
never throttle IO if sdev->queue_depth is set to HBA queue depth.
In case of nr_hw_queues = "CPUs" throttle IO at SCSI level since we never
allow more than "updated can_queue" in LLD.

Below code bring actual HBA can_queue very low ( Ea on 96 logical core CPU
new can_queue goes to 42, if HBA queue depth is 4K). It means we will see
lots of IO throttling in scsi mid layer due to shost->can_queue reach the
limit very soon if you have  jobs with higher QD.

if (ioc->shost->nr_hw_queues > 1) {
ioc->shost->nr_hw_queues = ioc->msix_vector_count;
ioc->shost->can_queue /= ioc->msix_vector_count;
}
I observe negative performance if I have 8 SSD drives attached to Ventura
(latest IT controller). 16 fio jobs at QD=128 gives ~1600K IOPs and the
moment I switch to nr_hw_queues = "CPUs", it gave hardly ~850K IOPs. This is
mainly because of host_busy stuck at very low ~169 on my setup.

May be as Sreekanth mentioned, performance improvement you have observed is
due to nomerges=2 is not set and OS will attempt soft back/front merge.

I debug live machine and understood we never see parallel instance of
"scsi_dispatch_cmd" as we expect due to can_queue is less. If we really has
*very* large HBA QD, this patch #10 to expose multiple SQ may be useful.

For now, we are looking for updated version of patch which will only keep IT
HBA in SQ mode (like we are doing in  driver) and add
interface to use blk_tag in both scsi.mq and !scsi.mq mode.  Sreekanth has
already started working on it, but we may need to check full performance
test run to post the actual patch.
May be we can cherry pick few patches from this series and get blk_tag
support to improve performance of  later which will not allow use
to choose nr_hw_queue to be tunable.

Thanks, Kashyap


>
> Thanks,
> Sreekanth


RE: [PATCH 00/10] mpt3sas: full mq support

2017-02-16 Thread Kashyap Desai
> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Wednesday, February 15, 2017 3:35 PM
> To: Kashyap Desai; Sreekanth Reddy
> Cc: Christoph Hellwig; Martin K. Petersen; James Bottomley; linux-
> s...@vger.kernel.org; Sathya Prakash Veerichetty; PDL-MPT-FUSIONLINUX
> Subject: Re: [PATCH 00/10] mpt3sas: full mq support
>
> On 02/15/2017 10:18 AM, Kashyap Desai wrote:
> >>
> >>
> >> Hannes,
> >>
> >> Result I have posted last time is with merge operation enabled in
> >> block layer. If I disable merge operation then I don't see much
> >> improvement with multiple hw request queues. Here is the result,
> >>
> >> fio results when nr_hw_queues=1,
> >> 4k read when numjobs=24: io=248387MB, bw=1655.1MB/s, iops=423905,
> >> runt=150003msec
> >>
> >> fio results when nr_hw_queues=24,
> >> 4k read when numjobs=24: io=263904MB, bw=1759.4MB/s, iops=450393,
> >> runt=150001msec
> >
> > Hannes -
> >
> >  I worked with Sreekanth and also understand pros/cons of Patch #10.
> > " [PATCH 10/10] mpt3sas: scsi-mq interrupt steering"
> >
> > In above patch, can_queue of HBA is divided based on logic CPU, it
> > means we want to mimic as if mpt3sas HBA support multi queue
> > distributing actual resources which is single Submission H/W Queue.
> > This approach badly impact many performance areas.
> >
> > nr_hw_queues = 1 is what I observe as best performance approach since
> > it never throttle IO if sdev->queue_depth is set to HBA queue depth.
> > In case of nr_hw_queues = "CPUs" throttle IO at SCSI level since we
> > never allow more than "updated can_queue" in LLD.
> >
> True.
> And this was actually one of the things I wanted to demonstrate with this
> patchset :-) ATM blk-mq really works best when having a distinct tag space
> per port/device. As soon as the hardware provides a _shared_ tag space you
> end up with tag starvation issues as blk-mq only allows you to do a static
> split of the available tagspace.
> While this patchset demonstrates that the HBA itself _does_ benefit from
> using block-mq (especially on highly parallel loads), it also demonstrates
> that
> _block-mq_ has issues with singlethreaded loads on this HBA (or, rather,
> type of HBA, as I doubt this issue is affecting mpt3sas only).
>
> > Below code bring actual HBA can_queue very low ( Ea on 96 logical core
> > CPU new can_queue goes to 42, if HBA queue depth is 4K). It means we
> > will see lots of IO throttling in scsi mid layer due to
> > shost->can_queue reach the limit very soon if you have  jobs with
> higher QD.
> >
> > if (ioc->shost->nr_hw_queues > 1) {
> > ioc->shost->nr_hw_queues = ioc->msix_vector_count;
> > ioc->shost->can_queue /= ioc->msix_vector_count;
> > }
> > I observe negative performance if I have 8 SSD drives attached to
> > Ventura (latest IT controller). 16 fio jobs at QD=128 gives ~1600K
> > IOPs and the moment I switch to nr_hw_queues = "CPUs", it gave hardly
> > ~850K IOPs. This is mainly because of host_busy stuck at very low ~169
> > on
> my setup.
> >
> Which actually might be an issue with the way scsi is hooked into blk-mq.
> The SCSI stack is using 'can_queue' as a check for 'host_busy', ie if the
> host is
> capable of accepting more commands.
> As we're limiting can_queue (to get the per-queue command depth
> correctly) we should be using the _overall_ command depth for the
> can_queue value itself to make the host_busy check work correctly.
>
> I've attached a patch for that; can you test if it makes a difference?
Hannes -
Attached patch works fine for me. FYI -  We need to set device queue depth
to can_queue as we are currently not doing in mpt3sas driver.

With attached patch when I tried, I see ~2-3% improvement running multiple
jobs. Single job profile no difference.

So looks like we are good to reach performance with single nr_hw_queues.

We have some patches to be send so want to know how to rebase this patch
series as few patches coming from Broadcom. Can we consider below as plan ?

- Patches from 1-7 will be reposted. Also Sreekanth will complete review on
existing patch 1-7.
- We need blk_tag support only for nr_hw_queue = 1.

With that say, we will have many code changes/function without "
shost_use_blk_mq" check and assume it is single nr_hw_queue supported
 driver.

Ea - Below function can be simplify - just refer tag from scmd->request and
don't need check of shost_use_blk_mq + nr_hw_queue etc..

u16
mpt

RE: [PATCH 00/10] mpt3sas: full mq support

2017-02-16 Thread Kashyap Desai
> > - Later we can explore if nr_hw_queue more than one really add benefit.
> > From current limited testing, I don't see major performance boost if
> > we have nr_hw_queue more than one.
> >
> Well, the _actual_ code to support mq is rather trivial, and really serves
> as a
> good testbed for scsi-mq.
> I would prefer to leave it in, and disable it via a module parameter.

I am thinking as adding extra code for more than one nr_hw_queue will add
maintenance overhead and support. Especially IO error handling code become
complex with nr_hw_queues > 1 case.  If we really like to see performance
boost, we should attempt and bare other side effect.

For time being we should drop this nr_hw_queue > 1 support is what I choose
(not even module parameter base).

>
> But in either case, I can rebase the patches to leave any notions of
> 'nr_hw_queues' to patch 8 for implementing full mq support.

Thanks Hannes. It was just heads up...We are not sure when we can submit
upcoming patch set from Broadcom. May be we can syncup with you offline in
case any rebase requires.

>
> And we need to discuss how to handle MPI2_FUNCTION_SCSI_IO_REQUEST;
> the current method doesn't work with blk-mq.
> I really would like to see that go, especially as sg/bsg supports the same
> functionality ...
>


[PATCH] megaraid_sas: enable intx only if msix request fails

2017-03-02 Thread Kashyap Desai
Without this fix, driver will enable INTx Interrupt pin even
 though MSI-x vectors are enabled. See below lspci output. DisINTx is unset
 for MSIx setup.

lspci -s 85:00.0 -vvv |grep INT |grep Control
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-

After applying this fix, driver will enable INTx Interrupt pin only if Legacy 
interrupt method is required.
See below lspci output. DisINTx is unset for MSIx setup.
lspci -s 85:00.0 -vvv |grep INT |grep Control
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx+

Signed-off-by: Kashyap Desai 
---
 drivers/scsi/megaraid/megaraid_sas_base.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c 
b/drivers/scsi/megaraid/megaraid_sas_base.c
index 7ac9a9e..82a8ec8 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -4990,6 +4990,7 @@ int megasas_set_crash_dump_params(struct megasas_instance 
*instance,
struct pci_dev *pdev;
 
pdev = instance->pdev;
+   pci_intx(pdev, 1);
instance->irq_context[0].instance = instance;
instance->irq_context[0].MSIxIndex = 0;
if (request_irq(pci_irq_vector(pdev, 0),
@@ -5277,10 +5278,6 @@ static int megasas_init_fw(struct megasas_instance 
*instance)
MPI2_REPLY_POST_HOST_INDEX_OFFSET);
}
 
-   i = pci_alloc_irq_vectors(instance->pdev, 1, 1, PCI_IRQ_LEGACY);
-   if (i < 0)
-   goto fail_setup_irqs;
-
dev_info(&instance->pdev->dev,
"firmware supports msix\t: (%d)", fw_msix_count);
dev_info(&instance->pdev->dev,
@@ -5494,7 +5491,6 @@ static int megasas_init_fw(struct megasas_instance 
*instance)
instance->instancet->disable_intr(instance);
 fail_init_adapter:
megasas_destroy_irqs(instance);
-fail_setup_irqs:
if (instance->msix_vectors)
pci_free_irq_vectors(instance->pdev);
instance->msix_vectors = 0;
-- 
1.8.3.1



RE: megasas: failed to get PD list

2014-07-18 Thread Kashyap Desai
Peter:

We will check with Falcon MR controller and will get back to you. Can you
quickly check using stable kernel instead of upstream. (Just to check if
it is regression>) ?

~ Kashyap

> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Peter Andersson
> Sent: Friday, July 18, 2014 12:01 AM
> To: linux-scsi@vger.kernel.org
> Subject: megasas: failed to get PD list
>
> Hi.
> I'm having major problems with my "MegaRAID SAS 9240-81" raid
controller.
> Basically the kernel doesn't find my raid drives.
>
> Bios version: 4.38.02.0 (January 16 2014) The bios of the raid
controller shows
> a fully working (and initialized) raid 0.
>
> Kernel: 3.15.1-031501-generic (ubuntu server 14.04)
>
> This is the full dmesg regarding the card:
>
> [0.902723] megasas: 06.803.01.00-rc1 Mon. Mar. 10 17:00:00 PDT 2014
> [0.902788] megasas: 0x1000:0x0073:0x1000:0x9240: bus 3:slot 0:func 0
> [0.903121] megasas: FW now in Ready state
> [0.903180] megaraid_sas :03:00.0: irq 80 for MSI/MSI-X
> [0.903188] megaraid_sas :03:00.0: [scsi0]: FW supports<0> MSIX
> vector,Online CPUs: <12>,Current MSIX <1>
> [0.924336] megasas_init_mfi: fw_support_ieee=67108864
> [0.924372] megasas: INIT adapter done
> [   72.855404] megasas: failed to get PD list
>
> lspci:
> 03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2008
> [Falcon] (rev 03)
>
> megaraidsas-status:
> -- Arrays informations --
> -- ID | Type | Size | Status
>
> -- Disks informations
> -- ID | Model | Status | Warnings
>
> After the drives are initialized i also get this message:
> Virtual drive handled by bios
>
> That's all i can think to be of interest to you.
> Could anyone give me some pointers on what could be wrong?
>
> Thanks!
>
> /Peter
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the
> body of a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 RESEND 14/23] megaraid: Use pci_enable_msix_range() instead of pci_enable_msix()

2014-08-11 Thread Kashyap Desai
> -Original Message-
> From: Alexander Gordeev [mailto:agord...@redhat.com]
> Sent: Monday, August 11, 2014 1:40 PM
> To: Kashyap Desai; Neela Syam Kolli
> Cc: linux-scsi@vger.kernel.org; linux-...@vger.kernel.org; linux-
> ker...@vger.kernel.org; Christoph Hellwig
> Subject: Re: [PATCH v2 RESEND 14/23] megaraid: Use
> pci_enable_msix_range() instead of pci_enable_msix()
>
> On Wed, Jul 16, 2014 at 08:05:18PM +0200, Alexander Gordeev wrote:
> > As result of deprecation of MSI-X/MSI enablement functions
> > pci_enable_msix() and pci_enable_msi_block() all drivers using these
> > two interfaces need to be updated to use the new
> > pci_enable_msi_range()  or pci_enable_msi_exact() and
> > pci_enable_msix_range() or pci_enable_msix_exact() interfaces.
>
> Kashyap, Neela,
>
> Could you please reveiw this patch?

Review this patch. Please consider this patch Acked by Me.
 Acked-by: Kashyap Desai 

>
> Thanks!
>
> > Signed-off-by: Alexander Gordeev 
> > Cc: Neela Syam Kolli 
> > Cc: linux-scsi@vger.kernel.org
> > Cc: linux-...@vger.kernel.org
> > ---
> >  drivers/scsi/megaraid/megaraid_sas_base.c |   20 +++-
> >  1 files changed, 7 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> > b/drivers/scsi/megaraid/megaraid_sas_base.c
> > index ba06102..7a4e75e 100644
> > --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> > @@ -4105,17 +4105,11 @@ static int megasas_init_fw(struct
> megasas_instance *instance)
> >  (unsigned
int)num_online_cpus());
> > for (i = 0; i < instance->msix_vectors; i++)
> > instance->msixentry[i].entry = i;
> > -   i = pci_enable_msix(instance->pdev, instance->msixentry,
> > -   instance->msix_vectors);
> > -   if (i >= 0) {
> > -   if (i) {
> > -   if (!pci_enable_msix(instance->pdev,
> > -instance->msixentry,
i))
> > -   instance->msix_vectors = i;
> > -   else
> > -   instance->msix_vectors = 0;
> > -   }
> > -   } else
> > +   i = pci_enable_msix_range(instance->pdev, instance-
> >msixentry,
> > + 1, instance->msix_vectors);
> > +   if (i)
> > +   instance->msix_vectors = i;
> > +   else
> > instance->msix_vectors = 0;
> >
> > dev_info(&instance->pdev->dev, "[scsi%d]: FW supports"
> > @@ -5135,8 +5129,8 @@ megasas_resume(struct pci_dev *pdev)
> >
> > /* Now re-enable MSI-X */
> > if (instance->msix_vectors &&
> > -   pci_enable_msix(instance->pdev, instance->msixentry,
> > -   instance->msix_vectors))
> > +   pci_enable_msix_exact(instance->pdev, instance->msixentry,
> > + instance->msix_vectors))
> > goto fail_reenable_msix;
> >
> > switch (instance->pdev->device) {
> > --
> > 1.7.7.6
> >
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.

2014-08-18 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Friday, July 18, 2014 3:43 PM
> To: James Bottomley; linux-scsi@vger.kernel.org
> Cc: Jens Axboe; Bart Van Assche; Mike Christie; Martin K. Petersen;
Robert
> Elliott; Webb Scales; linux-ker...@vger.kernel.org
> Subject: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
>
> This patch adds support for an alternate I/O path in the scsi midlayer
which
> uses the blk-mq infrastructure instead of the legacy request code.
>
> Use of blk-mq is fully transparent to drivers, although for now a host
> template field is provided to opt out of blk-mq usage in case any
unforseen
> incompatibilities arise.
>
> In general replacing the legacy request code with blk-mq is a simple and
> mostly mechanical transformation.  The biggest exception is the new code
> that deals with the fact the I/O submissions in blk-mq must happen from
> process context, which slightly complicates the I/O completion handler.
> The second biggest differences is that blk-mq is build around the
concept of
> preallocated requests that also include driver specific data, which in
SCSI
> context means the scsi_cmnd structure.  This completely avoids dynamic
> memory allocations for the fast path through I/O submission.
>
> Due the preallocated requests the MQ code path exclusively uses the
host-
> wide shared tag allocator instead of a per-LUN one.  This only affects
drivers
> actually using the block layer provided tag allocator instead of their
own.
> Unlike the old path blk-mq always provides a tag, although drivers don't
have
> to use it.
>
> For now the blk-mq path is disable by defauly and must be enabled using
the
> "use_blk_mq" module parameter.  Once the remaining work in the block
> layer to make blk-mq more suitable for slow devices is complete I hope
to
> make it the default and eventually even remove the old code path.
>
> Based on the earlier scsi-mq prototype by Nicholas Bellinger.
>
> Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking
and
> various sugestions and code contributions.
>
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Hannes Reinecke 
> Reviewed-by: Webb Scales 
> Acked-by: Jens Axboe 
> Tested-by: Bart Van Assche 
> Tested-by: Robert Elliott 
> ---
>  drivers/scsi/hosts.c  |  35 +++-
>  drivers/scsi/scsi.c   |   5 +-
>  drivers/scsi/scsi_lib.c   | 464
> --
>  drivers/scsi/scsi_priv.h  |   3 +
>  drivers/scsi/scsi_scan.c  |   5 +-
>  drivers/scsi/scsi_sysfs.c |   2 +
>  include/scsi/scsi_host.h  |  18 +-
>  include/scsi/scsi_tcq.h   |  28 ++-
>  8 files changed, 488 insertions(+), 72 deletions(-)
>
> diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index
0632eee..6de80e3
> 100644
> --- a/drivers/scsi/hosts.c
> +++ b/drivers/scsi/hosts.c
> @@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host
> *shost, struct device *dev,
>   goto fail;
>   }
>
> + if (shost_use_blk_mq(shost)) {
> + error = scsi_mq_setup_tags(shost);
> + if (error)
> + goto fail;
> + }
> +
> + /*
> +  * Note that we allocate the freelist even for the MQ case for
now,
> +  * as we need a command set aside for scsi_reset_provider.  Having
> +  * the full host freelist and one command available for that is a
> +  * little heavy-handed, but avoids introducing a special allocator
> +  * just for this.  Eventually the structure of scsi_reset_provider
> +  * will need a major overhaul.
> +  */
>   error = scsi_setup_command_freelist(shost);
>   if (error)
> - goto fail;
> + goto out_destroy_tags;
> +
>
>   if (!shost->shost_gendev.parent)
>   shost->shost_gendev.parent = dev ? dev : &platform_bus;
> @@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost,
> struct device *dev,
>
>   error = device_add(&shost->shost_gendev);
>   if (error)
> - goto out;
> + goto out_destroy_freelist;
>
>   pm_runtime_set_active(&shost->shost_gendev);
>   pm_runtime_enable(&shost->shost_gendev);
> @@ -279,8 +294,11 @@ int scsi_add_host_with_dma(struct Scsi_Host
> *shost, struct device *dev,
>   device_del(&shost->shost_dev);
>   out_del_gendev:
>   device_del(&shost->shost_gendev);
> - out:
> + out_destroy_freelist:
>   scsi_destroy_command_freelist(shost);
> + out_destroy_tags:
> + if (shost_use_blk_mq(shost))
> + scsi_mq_destroy_tags(shost);
>   fail:
>   return error;
>  }
> @@ -309,8 +327,13 @@ static void scsi_host_dev_release(struct device
> *dev)
>   }
>
>   scsi_destroy_command_freelist(shost);
> - if (shost->bqt)
> - blk_free_tags(shost->bqt);
> + if (shost_use_blk_mq(shost)) {
> + if (shost->tag_set.tags)
> + scsi_mq_destroy_tag

Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.

2014-08-19 Thread Kashyap Desai
On Tue, Aug 19, 2014 at 3:51 AM, Kashyap Desai
 wrote:
>
> > -Original Message-
> > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> > ow...@vger.kernel.org] On Behalf Of Christoph Hellwig
> > Sent: Friday, July 18, 2014 3:43 PM
> > To: James Bottomley; linux-scsi@vger.kernel.org
> > Cc: Jens Axboe; Bart Van Assche; Mike Christie; Martin K. Petersen;
> Robert
> > Elliott; Webb Scales; linux-ker...@vger.kernel.org
> > Subject: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
> >
> > This patch adds support for an alternate I/O path in the scsi midlayer
> which
> > uses the blk-mq infrastructure instead of the legacy request code.
> >
> > Use of blk-mq is fully transparent to drivers, although for now a host
> > template field is provided to opt out of blk-mq usage in case any
> unforseen
> > incompatibilities arise.
> >
> > In general replacing the legacy request code with blk-mq is a simple and
> > mostly mechanical transformation.  The biggest exception is the new code
> > that deals with the fact the I/O submissions in blk-mq must happen from
> > process context, which slightly complicates the I/O completion handler.
> > The second biggest differences is that blk-mq is build around the
> concept of
> > preallocated requests that also include driver specific data, which in
> SCSI
> > context means the scsi_cmnd structure.  This completely avoids dynamic
> > memory allocations for the fast path through I/O submission.
> >
> > Due the preallocated requests the MQ code path exclusively uses the
> host-
> > wide shared tag allocator instead of a per-LUN one.  This only affects
> drivers
> > actually using the block layer provided tag allocator instead of their
> own.
> > Unlike the old path blk-mq always provides a tag, although drivers don't
> have
> > to use it.
> >
> > For now the blk-mq path is disable by defauly and must be enabled using
> the
> > "use_blk_mq" module parameter.  Once the remaining work in the block
> > layer to make blk-mq more suitable for slow devices is complete I hope
> to
> > make it the default and eventually even remove the old code path.
> >
> > Based on the earlier scsi-mq prototype by Nicholas Bellinger.
> >
> > Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking
> and
> > various sugestions and code contributions.
> >
> > Signed-off-by: Christoph Hellwig 
> > Reviewed-by: Hannes Reinecke 
> > Reviewed-by: Webb Scales 
> > Acked-by: Jens Axboe 
> > Tested-by: Bart Van Assche 
> > Tested-by: Robert Elliott 
> > ---
> >  drivers/scsi/hosts.c  |  35 +++-
> >  drivers/scsi/scsi.c   |   5 +-
> >  drivers/scsi/scsi_lib.c   | 464
> > --
> >  drivers/scsi/scsi_priv.h  |   3 +
> >  drivers/scsi/scsi_scan.c  |   5 +-
> >  drivers/scsi/scsi_sysfs.c |   2 +
> >  include/scsi/scsi_host.h  |  18 +-
> >  include/scsi/scsi_tcq.h   |  28 ++-
> >  8 files changed, 488 insertions(+), 72 deletions(-)
> >
> > diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index
> 0632eee..6de80e3
> > 100644
> > --- a/drivers/scsi/hosts.c
> > +++ b/drivers/scsi/hosts.c
> > @@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host
> > *shost, struct device *dev,
> >   goto fail;
> >   }
> >
> > + if (shost_use_blk_mq(shost)) {
> > + error = scsi_mq_setup_tags(shost);
> > + if (error)
> > + goto fail;
> > + }
> > +
> > + /*
> > +  * Note that we allocate the freelist even for the MQ case for
> now,
> > +  * as we need a command set aside for scsi_reset_provider.  Having
> > +  * the full host freelist and one command available for that is a
> > +  * little heavy-handed, but avoids introducing a special allocator
> > +  * just for this.  Eventually the structure of scsi_reset_provider
> > +  * will need a major overhaul.
> > +  */
> >   error = scsi_setup_command_freelist(shost);
> >   if (error)
> > - goto fail;
> > + goto out_destroy_tags;
> > +
> >
> >   if (!shost->shost_gendev.parent)
> >   shost->shost_gendev.parent = dev ? dev : &platform_bus;
> > @@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost,
> > struct device *dev,
> >
> >   error = device_add(&shost->shost_gendev);
> >   if (erro

Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.

2014-08-19 Thread Kashyap Desai
On Tue, Aug 19, 2014 at 9:36 PM, Christoph Hellwig  wrote:
> On Tue, Aug 19, 2014 at 03:51:42AM +0530, Kashyap Desai wrote:
>> I read  this comment and find that very few drivers are using this
>> cmd_list.  I think if we remove this cmd_list, performance will scale as I
>> am seeing major contention in this lock.
>> Just thought to ping you to see if this is known limitation for now or any
>> plan to change this lock in near future ?
>
> Removing the lock entirely and pushing the list into the two drivers
> using it is on my TODO list.  Bart actually suggested keeping the code in the
> SCSI core and having a flag to enabled.  Given that I'm too busy to get my 
> full
> version done in time, it might be a good idea if someone picks up Barts
> idea.  Can you send me a patch to add a enable_cmd_list flag to the host
> template and only enable it for aacraid and dpt_i2o?
>

Sure. I will work on relevant code change and will post patch for review.

-- 
Device Driver Developer @ Avagotech
Kashyap D. Desai
Note - my new email address
kashyap.de...@avagotech.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] scsi.mq:Added enable_cmd_list flags in hostt to reduce lock contention

2014-08-20 Thread Kashyap Desai
> -Original Message-
> From: Bart Van Assche [mailto:bvanass...@acm.org]
> Sent: Wednesday, August 20, 2014 4:18 PM
> To: kashyap.de...@avagotech.com; linux-scsi@vger.kernel.org
> Cc: aacr...@adaptec.com; elli...@hp.com; jbottom...@parallels.com;
> h...@infradead.org
> Subject: Re: [PATCH] scsi.mq:Added enable_cmd_list flags in hostt to
> reduce
> lock contention
>
> On 08/19/14 20:17, kashyap.de...@avagotech.com wrote:
> > +   if (shost->hostt->enable_cmd_list) {
>
> This code is in the hot path which means that caching "enable_cmd_list"
> in struct Scsi_Host (as is done for many other SCSI host parameters)
> probably
> will (slightly) improve performance further. Otherwise this patch looks
> fine to
> me.
I will send updated patch which will cache host template field
"enable_cmd_list" for faster access in Scsi Host.

` Kashyap
>
> Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] scsi.mq:Added enable_cmd_list flags in hostt to reduce lock contention

2014-08-20 Thread Kashyap Desai
> -Original Message-
> From: Christoph Hellwig [mailto:h...@infradead.org]
> Sent: Wednesday, August 20, 2014 6:11 PM
> To: Kashyap Desai
> Cc: Bart Van Assche; linux-scsi@vger.kernel.org; aacr...@adaptec.com;
> elli...@hp.com; jbottom...@parallels.com; h...@infradead.org
> Subject: Re: [PATCH] scsi.mq:Added enable_cmd_list flags in hostt to
reduce
> lock contention
>
> On Wed, Aug 20, 2014 at 06:08:37PM +0530, Kashyap Desai wrote:
> > > This code is in the hot path which means that caching
"enable_cmd_list"
> > > in struct Scsi_Host (as is done for many other SCSI host parameters)
> > > probably will (slightly) improve performance further. Otherwise this
> > > patch looks fine to me.
> > I will send updated patch which will cache host template field
> > "enable_cmd_list" for faster access in Scsi Host.
>
> Thanks.  It might be worth to only set in in the host in fact.

Fine. I will remove host template entry and add code in aacraid and
dpt_i2o drivers to set that value directly in "Scsi Host"

>
> Also please just remove the code about lock contention in
scsi_mq_prep_fn
> - the XXX really doesn't apply anymore and I think the code should be
self-
> explaining enough to not need a comment.

I will do this.

>
> Otherwise the patch looks good to me.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 02/11] megaraid_sas : Use writeq for 64bit pci write to avoid spinlock overhead

2014-09-10 Thread Kashyap Desai
> -Original Message-
> From: Tomas Henzl [mailto:the...@redhat.com]
> Sent: Tuesday, September 09, 2014 7:01 PM
> To: sumit.sax...@avagotech.com; linux-scsi@vger.kernel.org
> Cc: martin.peter...@oracle.com; h...@infradead.org;
> jbottom...@parallels.com; kashyap.de...@avagotech.com;
> aradf...@gmail.com
> Subject: Re: [PATCH 02/11] megaraid_sas : Use writeq for 64bit pci write
to
> avoid spinlock overhead
>
> On 09/06/2014 03:25 PM, sumit.sax...@avagotech.com wrote:
> > Use writeq() for 64bit PCI write instead of writel() to avoid
additional lock
> overhead.
> >
> > Signed-off-by: Sumit Saxena 
> > Signed-off-by: Kashyap Desai 
> > ---
> >  drivers/scsi/megaraid/megaraid_sas_fusion.c | 8 
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> > b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> > index 57b47fe..c69c1ac 100644
> > --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> > +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> > @@ -1065,6 +1065,13 @@ megasas_fire_cmd_fusion(struct
> megasas_instance *instance,
> > u32 req_desc_hi,
> > struct megasas_register_set __iomem *regs)
>
> Hi Sumit,
> the fn params are a bit confusing req_desc_lo is of type dma_addr_t and
> req_desc_hi is u32, is it possible to unite it in the future?

Agree. We should make changes here. We will do it in separate patch.
Originally fire_cmd() was written for MFI controller and carry forward for
all generation.
In MFI it use second argument as 32 bit address  and third argument as
frame count, but later in Fusion adapter it started using differently.

>
> >  {
> > +#if defined(writeq) && defined(CONFIG_64BIT)
>
> On a similar place mpt2sas(_base_writeq) uses only "#ifndef writeq"
> if it's incorrect fix it there too or remove the CONFIG_64 here

We would like to change at mpt2sas as we have all the code with below
check for writeq()
Original discuss was started when we submitted this change in mpt2sas, but
we have delta from day-1.
LSI/Avago internal source has "#if defined(writeq) &&
defined(CONFIG_64BIT)" check in mpt2sas.

I think now writeq() is implemented in all arch, so we can safely remove
check for #if writeq().
But we can keep this check as it is to continue for older Distribution to
take direct advantage without maintaining any separate patch.

>
> > +   u64 req_data = 0;
> > +
> > +   req_data = req_desc_hi;
> > +   req_data = ((req_data << 32) | (u32)req_desc_lo);
>
> This seems to be critical path (you are removing an spinlock to avoid
> overhead), so why do you have three consecutive assignments to the same
> variable?
> (~(u64 req_data = r_hi << 32 | r_lo))


Agree. We will be doing this change and re-submit the patch to address
this.

>
> Cheers,
> Tomas
>
> > +   writeq(le64_to_cpu(req_data), &(regs)-
> >inbound_low_queue_port);
> > +#else
> > unsigned long flags;
> >
> > spin_lock_irqsave(&instance->hba_lock, flags); @@ -1072,6 +1079,7
> @@
> > megasas_fire_cmd_fusion(struct megasas_instance *instance,
> > writel(le32_to_cpu(req_desc_lo), &(regs)-
> >inbound_low_queue_port);
> > writel(le32_to_cpu(req_desc_hi), &(regs)-
> >inbound_high_queue_port);
> > spin_unlock_irqrestore(&instance->hba_lock, flags);
> > +#endif
> >  }
> >
> >  /**
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/11] megaraid_sas : Add module parameter to disable IRQ-CPU affinity hint

2014-09-10 Thread Kashyap Desai
On Wed, Sep 10, 2014 at 7:46 PM, Tomas Henzl  wrote:
> On 09/06/2014 03:25 PM, sumit.sax...@avagotech.com wrote:
>> For certain deployment, we may need to disable irq cpu affinity hint.
>> This module parameter provides option for use to disable irq cpu affinity 
>> hint
>> and allow irqbalancer to handle the rest.
>
> Only curious , in which environments causes this which issues?

For few deployment cases where CPU is managed via Cgroup and user
knows how they are going to use their system resources.
For those cases driver hinting may not be useful. For simplicity
customer might like this type of module parameter.

Kashyap

>
> Thanks,
> Tomas
>
>>
>> Signed-off-by: Sumit Saxena 
>> Signed-off-by: Kashyap Desai 
>> ---
>>  drivers/scsi/megaraid/megaraid_sas_base.c | 60 
>> +++
>>  1 file changed, 38 insertions(+), 22 deletions(-)
>>
>> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c 
>> b/drivers/scsi/megaraid/megaraid_sas_base.c
>> index 07255c1..086beee 100644
>> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
>> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
>> @@ -89,6 +89,10 @@ module_param(resetwaittime, int, S_IRUGO);
>>  MODULE_PARM_DESC(resetwaittime, "Wait time in seconds after I/O timeout "
>>"before resetting adapter. Default: 180");
>>
>> +int smp_affinity_enable = 1;
>> +module_param(smp_affinity_enable, int, S_IRUGO);
>> +MODULE_PARM_DESC(smp_affinity_enable, "SMP affinity feature enable/disbale 
>> Default: enable(1)");
>> +
>>  MODULE_LICENSE("GPL");
>>  MODULE_VERSION(MEGASAS_VERSION);
>>  MODULE_AUTHOR("megaraidli...@lsi.com");
>> @@ -5160,8 +5164,9 @@ retry_irq_register:
>>   printk(KERN_DEBUG "megasas: Failed to "
>>  "register IRQ for vector %d.\n", i);
>>   for (j = 0; j < i; j++) {
>> - irq_set_affinity_hint(
>> - instance->msixentry[j].vector, 
>> NULL);
>> + if (smp_affinity_enable)
>> + irq_set_affinity_hint(
>> + 
>> instance->msixentry[j].vector, NULL);
>>   free_irq(
>>   instance->msixentry[j].vector,
>>   &instance->irq_context[j]);
>> @@ -5170,11 +5175,14 @@ retry_irq_register:
>>   instance->msix_vectors = 0;
>>   goto retry_irq_register;
>>   }
>> - if 
>> (irq_set_affinity_hint(instance->msixentry[i].vector,
>> - get_cpu_mask(cpu)))
>> - dev_err(&instance->pdev->dev, "Error setting"
>> - "affinity hint for cpu %d\n", cpu);
>> - cpu = cpumask_next(cpu, cpu_online_mask);
>> + if (smp_affinity_enable) {
>> + if 
>> (irq_set_affinity_hint(instance->msixentry[i].vector,
>> + get_cpu_mask(cpu)))
>> + dev_err(&instance->pdev->dev,
>> + "Error setting affinity hint "
>> + "for cpu %d\n", cpu);
>> + cpu = cpumask_next(cpu, cpu_online_mask);
>> + }
>>   }
>>   } else {
>>   instance->irq_context[0].instance = instance;
>> @@ -5233,8 +5241,9 @@ retry_irq_register:
>>   instance->instancet->disable_intr(instance);
>>   if (instance->msix_vectors)
>>   for (i = 0; i < instance->msix_vectors; i++) {
>> - irq_set_affinity_hint(
>> - instance->msixentry[i].vector, NULL);
>> + if (smp_affinity_enable)
>> + irq_set_affinity_hint(
>> + instance->msixentry[i].vector, NULL);
>>   free_irq(instance->msixentry[i].vector,
>>&instance->irq_context[i]);
>>   }
>&

Re: [PATCH 05/11] megaraid_sas : Extended VD support

2014-09-10 Thread Kashyap Desai
On Wed, Sep 10, 2014 at 7:20 PM, Tomas Henzl  wrote:
>
> On 09/06/2014 03:25 PM, sumit.sax...@avagotech.com wrote:
> > Current MegaRAID firmware and hence the driver only supported 64VDs.
> > E.g: If the user wants to create more than 64VD on a controller,
> > it is not possible on current firmware/driver.
> >
> > New feature and requirement to support upto 256VD, firmware/driver/apps 
> > need changes.
> > In addition to that there must be a backward compatibility of the new 
> > driver with the
> > older firmware and vice versa.
> >
> > RAID map is the interface between Driver and FW to fetch all required
> > fields(attributes) for each Virtual Drives.
> > In the earlier design driver was using the FW copy of RAID map where as
> > in the new design the Driver will keep the RAID map copy of its own; on 
> > which
> > it will operate for any raid map access in fast path.
> >
> > Local driver raid map copy will provide ease of access through out the code
> > and provide generic interface for future FW raid map changes.
> >
> > For the backward compatibility driver will notify FW that it supports 256VD
> > to the FW in driver capability field.
> > Based on the controller properly returned by the FW, the Driver will know
> > whether it supports 256VD or not and will copy the RAID map accordingly.
> >
> > At any given time, driver will always have old or new Raid map.
> > So with this changes, driver can also work in host lock less mode. Please
> > see next patch which enable host lock less mode for megaraid_sas driver.
> >
> > Signed-off-by: Sumit Saxena 
> > Signed-off-by: Kashyap Desai 
> > ---
> >  drivers/scsi/megaraid/megaraid_sas.h|  73 +++---
> >  drivers/scsi/megaraid/megaraid_sas_base.c   | 205 
> > 
> >  drivers/scsi/megaraid/megaraid_sas_fp.c | 195 
> > ++
> >  drivers/scsi/megaraid/megaraid_sas_fusion.c | 118 
> >  drivers/scsi/megaraid/megaraid_sas_fusion.h |  95 -
> >  5 files changed, 502 insertions(+), 184 deletions(-)
> >
> > diff --git a/drivers/scsi/megaraid/megaraid_sas.h 
> > b/drivers/scsi/megaraid/megaraid_sas.h
> > index e0f03e2..5dedf09 100644
> > --- a/drivers/scsi/megaraid/megaraid_sas.h
> > +++ b/drivers/scsi/megaraid/megaraid_sas.h
> > @@ -390,7 +390,6 @@ enum MR_LD_QUERY_TYPE {
> >  #define MR_EVT_FOREIGN_CFG_IMPORTED 0x00db
> >  #define MR_EVT_LD_OFFLINE   0x00fc
> >  #define MR_EVT_CTRL_HOST_BUS_SCAN_REQUESTED 0x0152
> > -#define MAX_LOGICAL_DRIVES   64
> >
> >  enum MR_PD_STATE {
> >   MR_PD_STATE_UNCONFIGURED_GOOD   = 0x00,
> > @@ -468,14 +467,14 @@ struct MR_LD_LIST {
> >   u8  state;
> >   u8  reserved[3];
> >   u64 size;
> > - } ldList[MAX_LOGICAL_DRIVES];
> > + } ldList[MAX_LOGICAL_DRIVES_EXT];
> >  } __packed;
> >
> >  struct MR_LD_TARGETID_LIST {
> >   u32 size;
> >   u32 count;
> >   u8  pad[3];
> > - u8  targetId[MAX_LOGICAL_DRIVES];
> > + u8  targetId[MAX_LOGICAL_DRIVES_EXT];
> >  };
> >
> >
> > @@ -941,6 +940,15 @@ struct megasas_ctrl_info {
> >   * HA cluster information
> >   */
> >   struct {
> > +#if defined(__BIG_ENDIAN_BITFIELD)
> > + u32 reserved:26;
> > + u32 premiumFeatureMismatch:1;
> > + u32 ctrlPropIncompatible:1;
> > + u32 fwVersionMismatch:1;
> > + u32 hwIncompatible:1;
> > + u32 peerIsIncompatible:1;
> > + u32 peerIsPresent:1;
> > +#else
> >   u32 peerIsPresent:1;
> >   u32 peerIsIncompatible:1;
> >   u32 hwIncompatible:1;
> > @@ -948,6 +956,7 @@ struct megasas_ctrl_info {
> >   u32 ctrlPropIncompatible:1;
> >   u32 premiumFeatureMismatch:1;
> >   u32 reserved:26;
> > +#endif
> >   } cluster;
> >
> >   char clusterId[16]; /*7D4h */
> > @@ -962,9 +971,17 @@ struct megasas_ctrl_info {
> >  #if defined(__BIG_ENDIAN_BITFIELD)
> >   u32 reserved:25;
> >   u32 supportCrashDump:1;
> > - u32 reserved1:6;
> > + u32 supportMaxExtLDs:1;
> > + 

Re: [PATCH 10/11] megaraid_sas : MFI MPT linked list corruption fix

2014-09-10 Thread Kashyap Desai
On Wed, Sep 10, 2014 at 8:36 PM, Tomas Henzl  wrote:
> On 09/06/2014 03:25 PM, sumit.sax...@avagotech.com wrote:
>> Problem statement:
>> MFI link list in megaraid_sas driver is used from mfi-mpt pass-through 
>> commands.
>> This list can be corrupted due to many possible race conditions in driver and
>> eventually we may see kernel panic.
>>
>> One example -
>> MFI frame is freed from calling process as driver send command via polling 
>> method and interrupt
>> for that command comes after driver free mfi frame (actually even after some 
>> other context reuse
>> the mfi frame). When driver receive MPT frame in ISR, driver will be using 
>> the index of MFI and
>> access that MFI frame and finally in-used MFI frame’s list will be corrupted.
>>
>> High level description of new solution -
>> Free MFI and MPT command from same context.
>> Free both the command either from process (from where mfi-mpt pass-through 
>> was called) or from
>> ISR context. Do not split freeing of MFI and MPT, because it creates the 
>> race condition which
>> will do MFI/MPT list corruption.
>>
>> Renamed the cmd_pool_lock which is used in instance as well as fusion with 
>> below name.
>> mfi_pool_lock and mpt_pool_lock to add more code readability.
>>
>> Signed-off-by: Sumit Saxena 
>> Signed-off-by: Kashyap Desai 
>> ---
>>  drivers/scsi/megaraid/megaraid_sas.h|  25 +++-
>>  drivers/scsi/megaraid/megaraid_sas_base.c   | 196 
>> 
>>  drivers/scsi/megaraid/megaraid_sas_fusion.c |  95 ++
>>  drivers/scsi/megaraid/megaraid_sas_fusion.h |   2 +-
>>  4 files changed, 235 insertions(+), 83 deletions(-)
>>
>> diff --git a/drivers/scsi/megaraid/megaraid_sas.h 
>> b/drivers/scsi/megaraid/megaraid_sas.h
>> index 156d4b9..f99db18 100644
>> --- a/drivers/scsi/megaraid/megaraid_sas.h
>> +++ b/drivers/scsi/megaraid/megaraid_sas.h
>> @@ -1016,6 +1016,12 @@ struct megasas_ctrl_info {
>>
>>  #define VD_EXT_DEBUG 0
>>
>> +enum MR_MFI_MPT_PTHR_FLAGS {
>> + MFI_MPT_DETACHED = 0,
>> + MFI_LIST_ADDED = 1,
>> + MFI_MPT_ATTACHED = 2,
>> +};
>> +
>>  /* Frame Type */
>>  #define IO_FRAME 0
>>  #define PTHRU_FRAME  1
>> @@ -1033,7 +1039,7 @@ struct megasas_ctrl_info {
>>  #define MEGASAS_IOCTL_CMD0
>>  #define MEGASAS_DEFAULT_CMD_TIMEOUT  90
>>  #define MEGASAS_THROTTLE_QUEUE_DEPTH 16
>> -
>> +#define MEGASAS_BLOCKED_CMD_TIMEOUT  60
>>  /*
>>   * FW reports the maximum of number of commands that it can accept (maximum
>>   * commands that can be outstanding) at any time. The driver must report a
>> @@ -1652,7 +1658,7 @@ struct megasas_instance {
>>   struct megasas_cmd **cmd_list;
>>   struct list_head cmd_pool;
>>   /* used to sync fire the cmd to fw */
>> - spinlock_t cmd_pool_lock;
>> + spinlock_t mfi_pool_lock;
>>   /* used to sync fire the cmd to fw */
>>   spinlock_t hba_lock;
>>   /* used to synch producer, consumer ptrs in dpc */
>> @@ -1839,6 +1845,11 @@ struct megasas_cmd {
>>
>>   struct list_head list;
>>   struct scsi_cmnd *scmd;
>> +
>> + void *mpt_pthr_cmd_blocked;
>> + atomic_t mfi_mpt_pthr;
>> + u8 is_wait_event;
>> +
>>   struct megasas_instance *instance;
>>   union {
>>   struct {
>> @@ -1927,4 +1938,14 @@ int megasas_set_crash_dump_params(struct 
>> megasas_instance *instance,
>>  void megasas_free_host_crash_buffer(struct megasas_instance *instance);
>>  void megasas_fusion_crash_dump_wq(struct work_struct *work);
>>
>> +void megasas_return_cmd_fusion(struct megasas_instance *instance,
>> + struct megasas_cmd_fusion *cmd);
>> +int megasas_issue_blocked_cmd(struct megasas_instance *instance,
>> + struct megasas_cmd *cmd, int timeout);
>> +void __megasas_return_cmd(struct megasas_instance *instance,
>> + struct megasas_cmd *cmd);
>> +
>> +void megasas_return_mfi_mpt_pthr(struct megasas_instance *instance,
>> + struct megasas_cmd *cmd_mfi, struct megasas_cmd_fusion *cmd_fusion);
>> +
>>  #endif   /*LSI_MEGARAID_SAS_H */
>> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c 
>> b/drivers/scsi/megaraid/megaraid_sas_base.c
>> index 086beee..50d69eb 100644
>> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
>> +++ b/drivers/scsi/meg

RE: [PATCH 04/11] megaraid_sas : Firmware crash dump feature support

2014-09-11 Thread Kashyap Desai
> -Original Message-
> From: Vivek Goyal [mailto:vgo...@redhat.com]
> Sent: Wednesday, September 10, 2014 9:17 PM
> To: Tomas Henzl
> Cc: Elliott, Robert (Server Storage); Sumit Saxena;
linux-scsi@vger.kernel.org;
> martin.peter...@oracle.com; h...@infradead.org;
> jbottom...@parallels.com; Kashyap Desai; aradf...@gmail.com; Michal
> Schmidt; am...@mellanox.com; Jens Axboe 
> (ax...@kernel.dk); scame...@beardog.cce.hp.com
> Subject: Re: [PATCH 04/11] megaraid_sas : Firmware crash dump feature
> support
>
> On Wed, Sep 10, 2014 at 05:28:40PM +0200, Tomas Henzl wrote:
> > On 09/10/2014 05:06 PM, Elliott, Robert (Server Storage) wrote:
> > >> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> > >> ow...@vger.kernel.org] On Behalf Of Sumit Saxena
> > >>
> > >>> From: Tomas Henzl [mailto:the...@redhat.com]
> > >>>
> > >>> With several controllers in a system this may take a lot memory,
> > >>> could you also in case when a kdump kernel is running lower it, by
> > >>> not using this feature?
> > >>>
> > >> Agreed, we will disable this feature for kdump kernel by adding
> > >> "reset_devices" global varaiable.
> > >> That check is required for only one place, throughout the code,
> > >> this feature will remain disabled.  Code snippet for the same-
> > >>
> > >> instance->crash_dump_drv_support = (!reset_devices) &&
> > >> crashdump_enable &&
> > >> instance->crash_dump_fw_support &&
> > >> instance->crash_dump_buf);
> > >> if(instance->crash_dump_drv_support) {
> > >> printk(KERN_INFO "megaraid_sas: FW Crash dump is
> > >> supported\n");
> > >> megasas_set_crash_dump_params(instance,
> > >> MR_CRASH_BUF_TURN_OFF);
> > >>
> > >> } else {
> > >> ..
> > >> }
> > > Network drivers have been running into similar problems.
> > >
> > > There's a new patch from Amir coming through net-next to make
> > > is_kdump_kernel() (in crash_dump.h) accessible to modules.
> > > That may be a better signal than reset_devices that the driver
> > > should use minimal resources.
> > >
> > > http://comments.gmane.org/gmane.linux.network/324737
> > >
> > > I'm not sure about the logistics of a SCSI patch depending on a
> > > net-next patch.
> >
> > Probably better to start with reset_devices and switch to
> > is_kdump_kernel() later.
> > This is not a discussion about reset_devices versus is_kdump_kernel,
> > but while it looks good to have it distinguished - is the
> > reset_devices actually used anywhere else than in kdump kernel?
>
> I think usage of reset_devices for lowering memory footprint of driver
is
> plain wrong. It tells driver to only reset the device as BIOS might not
have
> done it right or we skipped BIOS completely.
>
> Using is_kdump_kernel() is also not perfect either but atleast better
than
> reset_devices.

We will use is_kdump_kernel() not to enable this feature in Kdump kernel.

MegaRaid Driver will not allocate Host memory (which is used to collect
complete FW crash image) until and unless associated FW is crashed/IO
timeout.
Driver do allocation only if required. Also, it will break the function
immediately after memory allocation is failed and operate on available
memory.

To be on safer side, we can disable this feature in Kdump mode.

~ Kashyap
>
> Thanks
> Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/11] megaraid_sas : Firmware crash dump feature support

2014-09-11 Thread Kashyap Desai
On Thu, Sep 11, 2014 at 4:50 PM, Tomas Henzl  wrote:
> On 09/11/2014 11:02 AM, Kashyap Desai wrote:
>>> -Original Message-
>>> From: Vivek Goyal [mailto:vgo...@redhat.com]
>>> Sent: Wednesday, September 10, 2014 9:17 PM
>>> To: Tomas Henzl
>>> Cc: Elliott, Robert (Server Storage); Sumit Saxena;
>> linux-scsi@vger.kernel.org;
>>> martin.peter...@oracle.com; h...@infradead.org;
>>> jbottom...@parallels.com; Kashyap Desai; aradf...@gmail.com; Michal
>>> Schmidt; am...@mellanox.com; Jens Axboe 
>>> (ax...@kernel.dk); scame...@beardog.cce.hp.com
>>> Subject: Re: [PATCH 04/11] megaraid_sas : Firmware crash dump feature
>>> support
>>>
>>> On Wed, Sep 10, 2014 at 05:28:40PM +0200, Tomas Henzl wrote:
>>>> On 09/10/2014 05:06 PM, Elliott, Robert (Server Storage) wrote:
>>>>>> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
>>>>>> ow...@vger.kernel.org] On Behalf Of Sumit Saxena
>>>>>>
>>>>>>> From: Tomas Henzl [mailto:the...@redhat.com]
>>>>>>>
>>>>>>> With several controllers in a system this may take a lot memory,
>>>>>>> could you also in case when a kdump kernel is running lower it, by
>>>>>>> not using this feature?
>>>>>>>
>>>>>> Agreed, we will disable this feature for kdump kernel by adding
>>>>>> "reset_devices" global varaiable.
>>>>>> That check is required for only one place, throughout the code,
>>>>>> this feature will remain disabled.  Code snippet for the same-
>>>>>>
>>>>>> instance->crash_dump_drv_support = (!reset_devices) &&
>>>>>> crashdump_enable &&
>>>>>> instance->crash_dump_fw_support &&
>>>>>> instance->crash_dump_buf);
>>>>>> if(instance->crash_dump_drv_support) {
>>>>>> printk(KERN_INFO "megaraid_sas: FW Crash dump is
>>>>>> supported\n");
>>>>>> megasas_set_crash_dump_params(instance,
>>>>>> MR_CRASH_BUF_TURN_OFF);
>>>>>>
>>>>>> } else {
>>>>>> ..
>>>>>> }
>>>>> Network drivers have been running into similar problems.
>>>>>
>>>>> There's a new patch from Amir coming through net-next to make
>>>>> is_kdump_kernel() (in crash_dump.h) accessible to modules.
>>>>> That may be a better signal than reset_devices that the driver
>>>>> should use minimal resources.
>>>>>
>>>>> http://comments.gmane.org/gmane.linux.network/324737
>>>>>
>>>>> I'm not sure about the logistics of a SCSI patch depending on a
>>>>> net-next patch.
>>>> Probably better to start with reset_devices and switch to
>>>> is_kdump_kernel() later.
>>>> This is not a discussion about reset_devices versus is_kdump_kernel,
>>>> but while it looks good to have it distinguished - is the
>>>> reset_devices actually used anywhere else than in kdump kernel?
>>> I think usage of reset_devices for lowering memory footprint of driver
>> is
>>> plain wrong. It tells driver to only reset the device as BIOS might not
>> have
>>> done it right or we skipped BIOS completely.
>>>
>>> Using is_kdump_kernel() is also not perfect either but atleast better
>> than
>>> reset_devices.
>> We will use is_kdump_kernel() not to enable this feature in Kdump kernel.
>
> OK, just keep in mind that the is_kdump_kernel() is not yet in mainline.

Sure I read that part, but when I check kernel source I found
is_kdump_kernel() is available.

http://lxr.free-electrons.com/source/include/linux/crash_dump.h#L55
Let's do one thing - We will submit a separate patch for this to avoid
any confusion.


>
>>
>> MegaRaid Driver will not allocate Host memory (which is used to collect
>> complete FW crash image) until and unless associated FW is crashed/IO
>> timeout.
>
> This is new for the driver? A previous reply stated that the driver will
> allocate 1MB for each MR controller at the 'start of the day'.
> If I misunderstood it somehow then nothing is needed.

This is correct. 1MB DMA buffer per controller will be allocated
irrespective of FW is crashed or not.
When FW actually crash driver will go and try to allocate upto 512MB
memory. I though you queried for 512 MB memory.

>
>> Driver do allocation only if required. Also, it will break the function
>> immediately after memory allocation is failed and operate on available
>> memory.
>>
>> To be on safer side, we can disable this feature in Kdump mode.
>>
>> ~ Kashyap
>>> Thanks
>>> Vivek
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Device Driver Developer @ Avagotech
Kashyap D. Desai
Note - my new email address
kashyap.de...@avagotech.com
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/11] megaraid_sas : MFI MPT linked list corruption fix

2014-09-11 Thread Kashyap Desai
On Thu, Sep 11, 2014 at 5:53 PM, Tomas Henzl  wrote:
> On 09/11/2014 04:48 AM, Kashyap Desai wrote:
>> On Wed, Sep 10, 2014 at 8:36 PM, Tomas Henzl  wrote:
>>> On 09/06/2014 03:25 PM, sumit.sax...@avagotech.com wrote:
>>>> Problem statement:
>>>> MFI link list in megaraid_sas driver is used from mfi-mpt pass-through 
>>>> commands.
>>>> This list can be corrupted due to many possible race conditions in driver 
>>>> and
>>>> eventually we may see kernel panic.
>>>>
>>>> One example -
>>>> MFI frame is freed from calling process as driver send command via polling 
>>>> method and interrupt
>>>> for that command comes after driver free mfi frame (actually even after 
>>>> some other context reuse
>>>> the mfi frame). When driver receive MPT frame in ISR, driver will be using 
>>>> the index of MFI and
>>>> access that MFI frame and finally in-used MFI frame’s list will be 
>>>> corrupted.
>>>>
>>>> High level description of new solution -
>>>> Free MFI and MPT command from same context.
>>>> Free both the command either from process (from where mfi-mpt pass-through 
>>>> was called) or from
>>>> ISR context. Do not split freeing of MFI and MPT, because it creates the 
>>>> race condition which
>>>> will do MFI/MPT list corruption.
>>>>
>>>> Renamed the cmd_pool_lock which is used in instance as well as fusion with 
>>>> below name.
>>>> mfi_pool_lock and mpt_pool_lock to add more code readability.
>>>>
>>>> Signed-off-by: Sumit Saxena 
>>>> Signed-off-by: Kashyap Desai 
>>>> ---
>>>>  drivers/scsi/megaraid/megaraid_sas.h|  25 +++-
>>>>  drivers/scsi/megaraid/megaraid_sas_base.c   | 196 
>>>> 
>>>>  drivers/scsi/megaraid/megaraid_sas_fusion.c |  95 ++
>>>>  drivers/scsi/megaraid/megaraid_sas_fusion.h |   2 +-
>>>>  4 files changed, 235 insertions(+), 83 deletions(-)
>>>>
>>>> diff --git a/drivers/scsi/megaraid/megaraid_sas.h 
>>>> b/drivers/scsi/megaraid/megaraid_sas.h
>>>> index 156d4b9..f99db18 100644
>>>> --- a/drivers/scsi/megaraid/megaraid_sas.h
>>>> +++ b/drivers/scsi/megaraid/megaraid_sas.h
>>>> @@ -1016,6 +1016,12 @@ struct megasas_ctrl_info {
>>>>
>>>>  #define VD_EXT_DEBUG 0
>>>>
>>>> +enum MR_MFI_MPT_PTHR_FLAGS {
>>>> + MFI_MPT_DETACHED = 0,
>>>> + MFI_LIST_ADDED = 1,
>>>> + MFI_MPT_ATTACHED = 2,
>>>> +};
>>>> +
>>>>  /* Frame Type */
>>>>  #define IO_FRAME 0
>>>>  #define PTHRU_FRAME  1
>>>> @@ -1033,7 +1039,7 @@ struct megasas_ctrl_info {
>>>>  #define MEGASAS_IOCTL_CMD0
>>>>  #define MEGASAS_DEFAULT_CMD_TIMEOUT  90
>>>>  #define MEGASAS_THROTTLE_QUEUE_DEPTH 16
>>>> -
>>>> +#define MEGASAS_BLOCKED_CMD_TIMEOUT  60
>>>>  /*
>>>>   * FW reports the maximum of number of commands that it can accept 
>>>> (maximum
>>>>   * commands that can be outstanding) at any time. The driver must report a
>>>> @@ -1652,7 +1658,7 @@ struct megasas_instance {
>>>>   struct megasas_cmd **cmd_list;
>>>>   struct list_head cmd_pool;
>>>>   /* used to sync fire the cmd to fw */
>>>> - spinlock_t cmd_pool_lock;
>>>> + spinlock_t mfi_pool_lock;
>>>>   /* used to sync fire the cmd to fw */
>>>>   spinlock_t hba_lock;
>>>>   /* used to synch producer, consumer ptrs in dpc */
>>>> @@ -1839,6 +1845,11 @@ struct megasas_cmd {
>>>>
>>>>   struct list_head list;
>>>>   struct scsi_cmnd *scmd;
>>>> +
>>>> + void *mpt_pthr_cmd_blocked;
>>>> + atomic_t mfi_mpt_pthr;
>>>> + u8 is_wait_event;
>>>> +
>>>>   struct megasas_instance *instance;
>>>>   union {
>>>>   struct {
>>>> @@ -1927,4 +1938,14 @@ int megasas_set_crash_dump_params(struct 
>>>> megasas_instance *instance,
>>>>  void megasas_free_host_crash_buffer(struct megasas_instance *instance);
>>>>  void megasas_fusion_crash_dump_wq(struct work

Re: [PATCH 04/11] megaraid_sas : Firmware crash dump feature support

2014-09-11 Thread Kashyap Desai
On Fri, Sep 12, 2014 at 12:28 AM, Tomas Henzl  wrote:
> On 09/11/2014 06:39 PM, Kashyap Desai wrote:
>> On Thu, Sep 11, 2014 at 4:50 PM, Tomas Henzl  wrote:
>>> On 09/11/2014 11:02 AM, Kashyap Desai wrote:
>>>>> -Original Message-
>>>>> From: Vivek Goyal [mailto:vgo...@redhat.com]
>>>>> Sent: Wednesday, September 10, 2014 9:17 PM
>>>>> To: Tomas Henzl
>>>>> Cc: Elliott, Robert (Server Storage); Sumit Saxena;
>>>> linux-scsi@vger.kernel.org;
>>>>> martin.peter...@oracle.com; h...@infradead.org;
>>>>> jbottom...@parallels.com; Kashyap Desai; aradf...@gmail.com; Michal
>>>>> Schmidt; am...@mellanox.com; Jens Axboe 
>>>>> (ax...@kernel.dk); scame...@beardog.cce.hp.com
>>>>> Subject: Re: [PATCH 04/11] megaraid_sas : Firmware crash dump feature
>>>>> support
>>>>>
>>>>> On Wed, Sep 10, 2014 at 05:28:40PM +0200, Tomas Henzl wrote:
>>>>>> On 09/10/2014 05:06 PM, Elliott, Robert (Server Storage) wrote:
>>>>>>>> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
>>>>>>>> ow...@vger.kernel.org] On Behalf Of Sumit Saxena
>>>>>>>>
>>>>>>>>> From: Tomas Henzl [mailto:the...@redhat.com]
>>>>>>>>>
>>>>>>>>> With several controllers in a system this may take a lot memory,
>>>>>>>>> could you also in case when a kdump kernel is running lower it, by
>>>>>>>>> not using this feature?
>>>>>>>>>
>>>>>>>> Agreed, we will disable this feature for kdump kernel by adding
>>>>>>>> "reset_devices" global varaiable.
>>>>>>>> That check is required for only one place, throughout the code,
>>>>>>>> this feature will remain disabled.  Code snippet for the same-
>>>>>>>>
>>>>>>>> instance->crash_dump_drv_support = (!reset_devices) &&
>>>>>>>> crashdump_enable &&
>>>>>>>> instance->crash_dump_fw_support &&
>>>>>>>> instance->crash_dump_buf);
>>>>>>>> if(instance->crash_dump_drv_support) {
>>>>>>>> printk(KERN_INFO "megaraid_sas: FW Crash dump is
>>>>>>>> supported\n");
>>>>>>>> megasas_set_crash_dump_params(instance,
>>>>>>>> MR_CRASH_BUF_TURN_OFF);
>>>>>>>>
>>>>>>>> } else {
>>>>>>>> ..
>>>>>>>> }
>>>>>>> Network drivers have been running into similar problems.
>>>>>>>
>>>>>>> There's a new patch from Amir coming through net-next to make
>>>>>>> is_kdump_kernel() (in crash_dump.h) accessible to modules.
>>>>>>> That may be a better signal than reset_devices that the driver
>>>>>>> should use minimal resources.
>>>>>>>
>>>>>>> http://comments.gmane.org/gmane.linux.network/324737
>>>>>>>
>>>>>>> I'm not sure about the logistics of a SCSI patch depending on a
>>>>>>> net-next patch.
>>>>>> Probably better to start with reset_devices and switch to
>>>>>> is_kdump_kernel() later.
>>>>>> This is not a discussion about reset_devices versus is_kdump_kernel,
>>>>>> but while it looks good to have it distinguished - is the
>>>>>> reset_devices actually used anywhere else than in kdump kernel?
>>>>> I think usage of reset_devices for lowering memory footprint of driver
>>>> is
>>>>> plain wrong. It tells driver to only reset the device as BIOS might not
>>>> have
>>>>> done it right or we skipped BIOS completely.
>>>>>
>>>>> Using is_kdump_kernel() is also not perfect either but atleast better
>>>> than
>>>>> reset_devices.
>>>> We will use is_kdump_kernel() not to enable this feature in Kdump kernel.
>>> OK, just keep in mind that the is_kdump_kernel() is not yet in mainline.
>> Sure I read that part, but when I check kernel source I found
>> is_kdump_kernel() is available.
>&

RE: [PATCH 10/11] megaraid_sas : MFI MPT linked list corruption fix

2014-09-11 Thread Kashyap Desai
> -Original Message-
> From: Tomas Henzl [mailto:the...@redhat.com]
> Sent: Friday, September 12, 2014 12:50 AM
> To: Kashyap Desai
> Cc: Sumit Saxena; linux-scsi@vger.kernel.org; Martin K. Petersen;
> Christoph
> Hellwig; jbottom...@parallels.com; aradford
> Subject: Re: [PATCH 10/11] megaraid_sas : MFI MPT linked list corruption
> fix
>
> On 09/11/2014 08:41 PM, Kashyap Desai wrote:
> > On Thu, Sep 11, 2014 at 5:53 PM, Tomas Henzl 
> wrote:
> >> On 09/11/2014 04:48 AM, Kashyap Desai wrote:
> >>> On Wed, Sep 10, 2014 at 8:36 PM, Tomas Henzl 
> wrote:
> >>>> On 09/06/2014 03:25 PM, sumit.sax...@avagotech.com wrote:
> >>>>> Problem statement:
> >>>>> MFI link list in megaraid_sas driver is used from mfi-mpt
> >>>>> pass-through
> commands.
> >>>>> This list can be corrupted due to many possible race conditions in
> >>>>> driver and eventually we may see kernel panic.
> >>>>>
> >>>>> One example -
> >>>>> MFI frame is freed from calling process as driver send command via
> >>>>> polling method and interrupt for that command comes after driver
> >>>>> free mfi frame (actually even after some other context reuse the
> >>>>> mfi frame). When driver receive MPT frame in ISR, driver will be
> >>>>> using
> the index of MFI and access that MFI frame and finally in-used MFI frame’s
> list will be corrupted.
> >>>>>
> >>>>> High level description of new solution - Free MFI and MPT command
> >>>>> from same context.
> >>>>> Free both the command either from process (from where mfi-mpt
> >>>>> pass-through was called) or from ISR context. Do not split freeing
> >>>>> of MFI and MPT, because it creates the race condition which will do
> MFI/MPT list corruption.
> >>>>>
> >>>>> Renamed the cmd_pool_lock which is used in instance as well as
> fusion with below name.
> >>>>> mfi_pool_lock and mpt_pool_lock to add more code readability.
> >>>>>
> >>>>> Signed-off-by: Sumit Saxena 
> >>>>> Signed-off-by: Kashyap Desai 
> >>>>> ---
> >>>>>  drivers/scsi/megaraid/megaraid_sas.h|  25 +++-
> >>>>>  drivers/scsi/megaraid/megaraid_sas_base.c   | 196
> 
> >>>>>  drivers/scsi/megaraid/megaraid_sas_fusion.c |  95 ++
> >>>>>  drivers/scsi/megaraid/megaraid_sas_fusion.h |   2 +-
> >>>>>  4 files changed, 235 insertions(+), 83 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/scsi/megaraid/megaraid_sas.h
> >>>>> b/drivers/scsi/megaraid/megaraid_sas.h
> >>>>> index 156d4b9..f99db18 100644
> >>>>> --- a/drivers/scsi/megaraid/megaraid_sas.h
> >>>>> +++ b/drivers/scsi/megaraid/megaraid_sas.h
> >>>>> @@ -1016,6 +1016,12 @@ struct megasas_ctrl_info {
> >>>>>
> >>>>>  #define VD_EXT_DEBUG 0
> >>>>>
> >>>>> +enum MR_MFI_MPT_PTHR_FLAGS {
> >>>>> + MFI_MPT_DETACHED = 0,
> >>>>> + MFI_LIST_ADDED = 1,
> >>>>> + MFI_MPT_ATTACHED = 2,
> >>>>> +};
> >>>>> +
> >>>>>  /* Frame Type */
> >>>>>  #define IO_FRAME 0
> >>>>>  #define PTHRU_FRAME  1
> >>>>> @@ -1033,7 +1039,7 @@ struct megasas_ctrl_info {
> >>>>>  #define MEGASAS_IOCTL_CMD0
> >>>>>  #define MEGASAS_DEFAULT_CMD_TIMEOUT  90
> >>>>>  #define MEGASAS_THROTTLE_QUEUE_DEPTH 16
> >>>>> -
> >>>>> +#define MEGASAS_BLOCKED_CMD_TIMEOUT  60
> >>>>>  /*
> >>>>>   * FW reports the maximum of number of commands that it can
> accept (maximum
> >>>>>   * commands that can be outstanding) at any time. The driver must
> >>>>> report a @@ -1652,7 +1658,7 @@ struct megasas_instance {
> >>>>>   struct megasas_cmd **cmd_list;
> >>>>>   struct list_head cmd_pool;
> >>>>>   /* used to sync fire the cmd to fw */
> >>>>> - spinlock_t cmd_pool_lock;
> >>>>> + spinlock_t mfi_pool_lock;
> >>>&

RE: [PATCH] scsi: megaraid_sas: Prevent future %p disaster

2015-02-10 Thread Kashyap Desai
Acked-by: Kashyap Desai 

> -Original Message-
> From: Rasmus Villemoes [mailto:li...@rasmusvillemoes.dk]
> Sent: Friday, February 06, 2015 8:04 PM
> To: Kashyap Desai; Sumit Saxena; Uday Lingala; James E.J. Bottomley
> Cc: Rasmus Villemoes; megaraidlinux@avagotech.com; linux-
> s...@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: [PATCH] scsi: megaraid_sas: Prevent future %p disaster
>
> There is currently no %po format extension, so currently the letters
"on" are
> simply skipped and the pointer is printed as expected (while missing the
word
> on). However, it is probably only a matter of time before someone comes
up
> with a %po extension, at which point this is likely to fail
spectacularly.
>
> Signed-off-by: Rasmus Villemoes 
> ---
>  drivers/scsi/megaraid/megaraid_sas_base.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> b/drivers/scsi/megaraid/megaraid_sas_base.c
> index ff283d23788a..e7c6b9c946d6 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -3105,7 +3105,7 @@ megasas_internal_reset_defer_cmds(struct
> megasas_instance *instance)
>   for (i = 0; i < max_cmd; i++) {
>   cmd = instance->cmd_list[i];
>   if (cmd->sync_cmd == 1 || cmd->scmd) {
> - printk(KERN_NOTICE "megasas: moving
> cmd[%d]:%p:%d:%p"
> + printk(KERN_NOTICE "megasas: moving
> cmd[%d]:%p:%d:%p "
>   "on the defer queue as
internal\n",
>   defer_index, cmd, cmd->sync_cmd, cmd-
> >scmd);
>
> --
> 2.1.3
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Scsi Error handling query

2015-03-26 Thread Kashyap Desai
Hi Hannes,

I was going through one of the slide posted at below link.

http://events.linuxfoundation.org/sites/events/files/slides/SCSI-EH.pdf

Slide #59 has below data. I was trying to correlate with latest upstream
code, but do not understand few things. Does Linux handle blocking I/O to
the device and target before it actually start legacy EH recovery ? Also,
how does linux scsi stack achieve task set abort ?

Proposed SCSI EH strategy
• Send command aborts after timeout
• EH Recovery starts:
‒ Block I/O to the device
   ‒ Issue 'Task Set Abort'
‒ Block I/O to the target
   ‒ Issue I_T Nexus Reset
   ‒ Complete outstanding command on success
‒ Engage current EH strategy
   ‒ LUN Reset, Target Reset etc


Thanks, Kashyap
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Scsi Error handling query

2015-03-26 Thread Kashyap Desai
> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Thursday, March 26, 2015 9:28 PM
> To: Kashyap Desai; linux-scsi@vger.kernel.org
> Subject: Re: Scsi Error handling query
>
> On 03/26/2015 02:38 PM, Kashyap Desai wrote:
> > Hi Hannes,
> >
> > I was going through one of the slide posted at below link.
> >
> > http://events.linuxfoundation.org/sites/events/files/slides/SCSI-EH.pd
> > f
> >
> > Slide #59 has below data. I was trying to correlate with latest
> > upstream code, but do not understand few things. Does Linux handle
> > blocking I/O to the device and target before it actually start legacy EH
> recovery ?
>
> Yes. This is handled by 'scsi_eh_scmd_add()', which adds the command to
> the
> internal 'eh_entry' list and starts recovery once all remaining
> outstanding
> commands are completed.

Thanks Hannes..! Scsi_eh_scmd_add() move shost state to recovery, so it
means  blocking further IO to the Host and not really a limited to
Device/Target for which command was timed out. Right ?
I understood that, new improvement of scsi error handling will allow IOs to
the other Devices attached to the host except the IO belongs to specific
target.

Also, one more thing to clarify... In presentation, term "task set aborts"
was used. Does this mean task set abort is handled as traversing complete
list of timed out command and sending individual TASK ABORT ?

>
> > Also, how does linux scsi stack achieve task set abort ?
> >
> Currently we don't :-)
> The presentation was a roadmap about future EH updates.
>
> > Proposed SCSI EH strategy
> > • Send command aborts after timeout
> > • EH Recovery starts:
> > ‒ Block I/O to the device
> >‒ Issue 'Task Set Abort'
> > ‒ Block I/O to the target
> >‒ Issue I_T Nexus Reset
> >‒ Complete outstanding command on success ‒ Engage current EH
> > strategy
> >‒ LUN Reset, Target Reset etc
> >
> The current plans for EH updates are:
>
> - Convert eh_host_reset_handler() to take Scsi_Host as argument
>   - Convert EH host reset to do a host rescan after try_host_reset()
> succeeded
>   - Terminate failed scmds prior to calling try_host_reset()
>   => with that we should be able to instantiate a quick failover
>  when running under multipathing, as then I/Os will be returned
>  prior to the host reset (which is know to take quite a long
>  time)
>
> - Convert the remaining eh_XXX_reset_handler() to take the
>   appropriate structure as argument.
>   This will require some work, as some EH handler implementation
>   re-use the command tag (or even the actual command) for sending
>   TMFs.
>
> - Implementing a 'transport reset' EH function; to be called
>   after the current EH LUN Reset
>
> - Investigating the possibilty for an asynchronous 'task set abort',
>   and make the 'transport reset' EH function asynchronous, too.
>
> I've got a patchset for the first step, but the others still require some
> work ...
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke  zSeries & Storage
> h...@suse.de +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284
> (AG
> Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Scsi Error handling query

2015-03-30 Thread Kashyap Desai
> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Friday, March 27, 2015 9:32 PM
> To: Kashyap Desai; linux-scsi@vger.kernel.org
> Subject: Re: Scsi Error handling query
>
> On 03/26/2015 07:43 PM, Kashyap Desai wrote:
> >> -Original Message-
> >> From: Hannes Reinecke [mailto:h...@suse.de]
> >> Sent: Thursday, March 26, 2015 9:28 PM
> >> To: Kashyap Desai; linux-scsi@vger.kernel.org
> >> Subject: Re: Scsi Error handling query
> >>
> >> On 03/26/2015 02:38 PM, Kashyap Desai wrote:
> >>> Hi Hannes,
> >>>
> >>> I was going through one of the slide posted at below link.
> >>>
> >>> http://events.linuxfoundation.org/sites/events/files/slides/SCSI-EH.
> >>> pd
> >>> f
> >>>
> >>> Slide #59 has below data. I was trying to correlate with latest
> >>> upstream code, but do not understand few things. Does Linux handle
> >>> blocking I/O to the device and target before it actually start
> >>> legacy EH
> >> recovery ?
> >>
> >> Yes. This is handled by 'scsi_eh_scmd_add()', which adds the command
> >> to the internal 'eh_entry' list and starts recovery once all
> >> remaining outstanding commands are completed.
> >
> > Thanks Hannes..! Scsi_eh_scmd_add() move shost state to recovery, so
> > it means  blocking further IO to the Host and not really a limited to
> > Device/Target for which command was timed out. Right ?
> > I understood that, new improvement of scsi error handling will allow
> > IOs to the other Devices attached to the host except the IO belongs to
> > specific target.
> >
> > Also, one more thing to clarify... In presentation, term "task set
> > aborts"
> > was used. Does this mean task set abort is handled as traversing
> > complete list of timed out command and sending individual TASK ABORT ?
> >
> No. The idea was to send 'task set aborts' as a single TMF.

Thanks Hannes.!  OK so idea was to have single TMF for "Task set abort."   I
am not sure how to frame my  next question.. But what if Linux SCSI layer
traverse each IO of one particular target and issue individual Task abort?
Don’t we call that as "task set aborts" ?  How LLD Driver should interface
for "task set aborts" as single TMF ? My understanding is "Task set abort"
will be internally converted to single Task abort either by SCSI layer or
HBA FW.

> However, I'm not sure if I'll be going ahead with that one; once we've
> issued a
> 'transport reset the commands will be cone anyway.
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke  zSeries & Storage
> h...@suse.de +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284
> (AG
> Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Scsi Error handling query

2015-03-31 Thread Kashyap Desai
> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Monday, March 30, 2015 8:43 PM
> To: Kashyap Desai; linux-scsi@vger.kernel.org
> Subject: Re: Scsi Error handling query
>
> On 03/30/2015 01:45 PM, Kashyap Desai wrote:
> >> -Original Message-
> >> From: Hannes Reinecke [mailto:h...@suse.de]
> >> Sent: Friday, March 27, 2015 9:32 PM
> >> To: Kashyap Desai; linux-scsi@vger.kernel.org
> >> Subject: Re: Scsi Error handling query
> >>
> >> On 03/26/2015 07:43 PM, Kashyap Desai wrote:
> >>>> -Original Message-
> >>>> From: Hannes Reinecke [mailto:h...@suse.de]
> >>>> Sent: Thursday, March 26, 2015 9:28 PM
> >>>> To: Kashyap Desai; linux-scsi@vger.kernel.org
> >>>> Subject: Re: Scsi Error handling query
> >>>>
> >>>> On 03/26/2015 02:38 PM, Kashyap Desai wrote:
> >>>>> Hi Hannes,
> >>>>>
> >>>>> I was going through one of the slide posted at below link.
> >>>>>
> >>>>> http://events.linuxfoundation.org/sites/events/files/slides/SCSI-EH.
> >>>>> pd
> >>>>> f
> >>>>>
> >>>>> Slide #59 has below data. I was trying to correlate with latest
> >>>>> upstream code, but do not understand few things. Does Linux handle
> >>>>> blocking I/O to the device and target before it actually start
> >>>>> legacy EH
> >>>> recovery ?
> >>>>
> >>>> Yes. This is handled by 'scsi_eh_scmd_add()', which adds the
> >>>> command to the internal 'eh_entry' list and starts recovery once
> >>>> all remaining outstanding commands are completed.
> >>>
> >>> Thanks Hannes..! Scsi_eh_scmd_add() move shost state to recovery, so
> >>> it means  blocking further IO to the Host and not really a limited
> >>> to Device/Target for which command was timed out. Right ?
> >>> I understood that, new improvement of scsi error handling will allow
> >>> IOs to the other Devices attached to the host except the IO belongs
> >>> to specific target.
> >>>
> >>> Also, one more thing to clarify... In presentation, term "task set
> >>> aborts"
> >>> was used. Does this mean task set abort is handled as traversing
> >>> complete list of timed out command and sending individual TASK ABORT ?
> >>>
> >> No. The idea was to send 'task set aborts' as a single TMF.
> >
> > Thanks Hannes.!  OK so idea was to have single TMF for "Task set abort."
> > I
> > am not sure how to frame my  next question.. But what if Linux SCSI
> > layer traverse each IO of one particular target and issue individual
> > Task abort?
> > Don’t we call that as "task set aborts" ?  How LLD Driver should
> > interface for "task set aborts" as single TMF ? My understanding is
> > "Task set
> abort"
> > will be internally converted to single Task abort either by SCSI layer
> > or HBA FW.
> >
> Why? There _is_ a 'task set abort' TMF defined in SAM.
> If the firmware doesn't implement it I'd thought the respective command to
> be
> failed?
Understood this part.  It is single TMF of "task set abort" is what
addressed here.

Is there any harm to do in Low level driver as below for Lun/Target reset ?

Once LLD enter into eh_abort_handler callback, try to do "Task Set Abort"
from LLD as single TMF (most likely convert single "task set abort" in FW
specific format)
Wait for completion and see if things are really resolve. If not, issue
Target Reset (from the FW) which will be SAS Link Hard Reset.

This way driver can  abort command related to associated I_T nexus.
I initially thought Linux scsi error handling actually does this because it
completely traverse all timed out command from error handling thread (which
will be almost equivalent to single TMF "Task set abort")

>
> However, at this point I'm not sure if 'task set abort' is actually
> required; it
> _should_ be superseded by the new 'transport reset' EH handler.
> On the FC side this will translate into a relogin, which will
> automatically abort all
> outstanding tasks.
> SAS even has a dedicated TMF IT NEXUS LOSS, which looks like it could be
> used
> here.

Not sure if my above comment/query is on same line.

>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke  zSeries & Storage
> h...@suse.de +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284
> (AG
> Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] megaraid:Make the function megasas_alloc_cmd_fusion static

2015-12-17 Thread Kashyap Desai
> -int
> +static int
>  megasas_alloc_cmds_fusion(struct megasas_instance *instance)  {
>   int i, j, count;

Good catch.  Let's not include patch to avoid further confusion on series
of acked patch which are not making significant functional difference.
NACK as not making any functional difference. We can make many more such
changes in one shot and let's address those later as we have few patches
ready to send in few days.


> --
> 2.5.0
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] megaraid:Make various functions static in megaraid_sas_fusion.c

2015-12-17 Thread Kashyap Desai
> -Original Message-
> From: Nicholas Krause [mailto:xerofo...@gmail.com]
> Sent: Monday, July 06, 2015 10:06 PM
> To: kashyap.de...@avagotech.com
> Cc: sumit.sax...@avagotech.com; uday.ling...@avagotech.com;
> jbottom...@odin.com; megaraidlinux@avagotech.com; linux-
> s...@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: [PATCH] megaraid:Make various functions static in
> megaraid_sas_fusion.c
>
> This makes various functions that have no external callers outside of
their
> definition/declaration in the file megaraid_sas_fusion.c to be properly
> declared as static functions now.
>
> Signed-off-by: Nicholas Krause 

Nicholas -

You posted patch with subject line "megaraid:Make the function
megasas_alloc_cmd_fusion static" for similar code changes.
Let's have consolidated patch for this and Avago can post it with next
change set (to keep git  commit clean without any failure)

Just to be clear - You can send us patch and we can add with Submitted by
your name as signature. For now NACK.

` Kashyap
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: megaraid_sas: add an i/o barrier

2016-01-31 Thread Kashyap Desai
> -Original Message-
> From: Tomas Henzl [mailto:the...@redhat.com]
> Sent: Friday, January 29, 2016 11:05 PM
> To: 'linux-scsi@vger.kernel.org'
> Cc: sumit.sax...@avagotech.com; Desai, Kashyap; Uday Lingala
> Subject: megaraid_sas: add an i/o barrier
>
> A barrier should be added to ensure proper ordering of memory mapped
> writes.
>
> Signed-off-by: Tomas Henzl 
> ---
>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> index d9d0029fb1..98a848bdfd 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> @@ -204,6 +204,7 @@ megasas_fire_cmd_fusion(struct
> megasas_instance *instance,
>   &instance->reg_set->inbound_low_queue_port);
>   writel(le32_to_cpu(req_desc->u.high),
>   &instance->reg_set->inbound_high_queue_port);
> + mmiowb();
>   spin_unlock_irqrestore(&instance->hba_lock, flags);  #endif  }

Tomas-

We may need similar changes around below Functions as well, because there is
no associated readX or mmiowb() call.
megasas_fire_cmd_xscale()
megasas_fire_cmd_ppc()
megasas_fire_cmd_skinny()
megasas_fire_cmd_gen2()

Also,  wrireq() routine in same function megasas_fire_cmd_fusion() need i/o
barrier.

> --
> 2.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: megaraid_sas: add an i/o barrier

2016-02-01 Thread Kashyap Desai
> -Original Message-
> From: Tomas Henzl [mailto:the...@redhat.com]
> Sent: Monday, February 01, 2016 6:23 PM
> To: Kashyap Desai; linux-scsi@vger.kernel.org
> Cc: Sumit Saxena; Uday Lingala
> Subject: Re: megaraid_sas: add an i/o barrier
>
> On 1.2.2016 05:45, Kashyap Desai wrote:
> >> -Original Message-
> >> From: Tomas Henzl [mailto:the...@redhat.com]
> >> Sent: Friday, January 29, 2016 11:05 PM
> >> To: 'linux-scsi@vger.kernel.org'
> >> Cc: sumit.sax...@avagotech.com; Desai, Kashyap; Uday Lingala
> >> Subject: megaraid_sas: add an i/o barrier
> >>
> >> A barrier should be added to ensure proper ordering of memory mapped
> >> writes.
> >>
> >> Signed-off-by: Tomas Henzl 
> >> ---
> >>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 1 +
> >>  1 file changed, 1 insertion(+)
> >>
> >> diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> >> b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> >> index d9d0029fb1..98a848bdfd 100644
> >> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> >> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> >> @@ -204,6 +204,7 @@ megasas_fire_cmd_fusion(struct
> megasas_instance
> >> *instance,
> >>&instance->reg_set->inbound_low_queue_port);
> >>writel(le32_to_cpu(req_desc->u.high),
> >>&instance->reg_set->inbound_high_queue_port);
> >> +  mmiowb();
> >>spin_unlock_irqrestore(&instance->hba_lock, flags);  #endif  }
> > Tomas-
> >
> > We may need similar changes around below Functions as well, because
> > there is no associated readX or mmiowb() call.
> > megasas_fire_cmd_xscale()
> > megasas_fire_cmd_ppc()
> > megasas_fire_cmd_skinny()
> > megasas_fire_cmd_gen2()
> >
> > Also,  wrireq() routine in same function megasas_fire_cmd_fusion()
> > need i/o barrier.
>
> I don't think so (with the exception of megasas_fire_cmd_skinny - I missed
> this one).
> When two threads try to use a fire_cmd there is no protection of certain
> ordering, that had to be done in a caller of fire_cmd (for example in
> megasas_build_and_issue_cmd_fusion)
> and it seems to me that there is nothing like that. Likely is, that this -
> a
> strict ordering of commands - is not needed.
> The protection which I'm adding is needed when a command consist of a
> sequence of more than one write, see memory-barriers.txt.
>
> (It looks to me that
> megasas_fire_cmd_ -xscale -ppc -skiny -gen2 do not need the hba_lock
> unless there is another i/o sequence protected with the same lock (note
> also that there is no such lock in megasas_fire_cmd_fusion).)
Agree, we don't need hba lock in older fire_cmd wrapper. We updated fusion
code because it was active shipment and product life cycle was active for
fusion adapter, so code changes was tested by Avago. We did not clean code
in MFI based controller since it is only in critical support cycle.

>
>
> I'll add the mmiowb to megasas_fire_cmd_skinny and send a new patch -
> agreed?

Got your word. You are right.  Yes additional changes in
megasas_fire_cmd_skinny along with current patch will be good to go.

>
> --tms
>
> >
> >> --
> >> 2.4.3
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2] megaraid_sas: add an i/o barrier

2016-02-10 Thread Kashyap Desai
> -Original Message-
> From: Martin K. Petersen [mailto:martin.peter...@oracle.com]
> Sent: Wednesday, February 10, 2016 10:55 PM
> To: Tomas Henzl
> Cc: 'linux-scsi@vger.kernel.org'; sumit.sax...@avagotech.com; Desai,
> Kashyap; Uday Lingala; sta...@vger.kernel.org
> Subject: Re: [PATCH v2] megaraid_sas: add an i/o barrier
>
> >>>>> "Tomas" == Tomas Henzl  writes:
>
> Tomas> A barrier should be added to ensure proper ordering of memory
> Tomas> mapped writes.
>
> Tomas> V2: - added the barrier also to megasas_fire_cmd_skinny, as
> Tomas> suggested by Kashyap Desai

Reviewed-by: Kashyap Desai 
Acked-by: Kashyap Desai 

>
> Sumit, Kashyap: Please review this patch.
>
> --
> Martin K. PetersenOracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/3] megaraid_sas: Convert dev_printk to dev_

2015-10-27 Thread Kashyap Desai
> > +   dev_dbg(&instance->pdev->dev, "Error copying out
> > cmd_status\n");
> > error = -EFAULT;
> > }
> >
>
> Reviewed-by: Johannes Thumshirn 

We will consider all three patches for future submission. As of now we have
two patch set pending to be committed.
We are working for few more patch in megaraid_sas which will do clean up in
driver module + proper error handling of DCMD command timeout. It will cover
Patch posted with below subject -

[PATCH 3/3] megaraid_sas: return -ENOMEM when create DMA pool for cmd frames
failed

James  -  We will be resending these patch set on top of latest outstanding
megaraid_sas driver patch, so that we can avoid any conflict in commits.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 02/12] megaraid_sas : Code optimization- remove PCI Ids based checks

2015-10-28 Thread Kashyap Desai
>
> > -Original Message-
> > From: Martin K. Petersen [mailto:martin.peter...@oracle.com]
> > Sent: Wednesday, October 28, 2015 8:03 AM
> > To: sumit.sax...@avagotech.com
> > Cc: linux-scsi@vger.kernel.org; the...@redhat.com;
> > martin.peter...@oracle.com; h...@infradead.org;
> > jbottom...@parallels.com; kashyap.de...@avagotech.com;
> > kiran-kumar.kast...@avagotech.com;
> > uday.ling...@avagotech.com
> > Subject: Re: [PATCH 02/12] megaraid_sas : Code optimization- remove
> > PCI
> Ids
> > based checks
> >
> > > "Sumit" == sumit saxena  writes:
> >
> > Sumit> Code optimization: remove PCI id based checks and instead of
> > Sumit> those use instance->ctrl_context to make call whether
> > Sumit> controller is MFI based OR fusion adapter.  fusion adapters
> > Sumit> further are also divided in two categories- 1)THUNDERBOLT
> > Sumit> SERIES and 2)INVADER SERIES.
> >
> > Does not apply against current upstream. Please fix and repost.
>
> Martin, This patch set is not based on current upstream but rebased on top
> of few patches sent by me prior to this patch series. Patches sent prior
> to
> this patch series are not yet accepted.

Martin -

Here is last outstanding patch series.
http://marc.info/?l=linux-scsi&m=144102204225400&w=2

Last patch series was missed because of small rebase issue. James suggested
not to ACK anything which is not critical to avoid such issue and probably
piggy back those patches.
Last series and latest series has reviewed-by tags. Can we have outstanding
patches committed in next window ?

>
> Thanks,
> Sumit
> >
> > --
> > Martin K. Petersen  Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [patch 2/2] megaraid_sas: missing curly braces in megasas_detach_one()

2015-11-10 Thread Kashyap Desai
> -Original Message-
> From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
> Sent: Saturday, November 07, 2015 9:47 PM
> To: Kashyap Desai; Sumit Saxena
> Cc: Uday Lingala; James E.J. Bottomley; megaraidlinux@avagotech.com;
> linux-scsi@vger.kernel.org; kernel-janit...@vger.kernel.org
> Subject: [patch 2/2] megaraid_sas: missing curly braces in
> megasas_detach_one()
>
> The indenting indicates that there are supposed to be some curly braces
> here.  Presumably it means we free something unintentionally leading to
a
> use after free.

NACK. Indentation is wrong. We don't need to add braces. We will fix
indentation and resend patch. Since this is not critical, let's ignore for
now.
>
> Fixes: 3761cb4cf65e ('megaraid_sas: JBOD sequence number support')
> Signed-off-by: Dan Carpenter 
> ---
> Not tested.
>
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> b/drivers/scsi/megaraid/megaraid_sas_base.c
> index 829e9e9..91e200d 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -5932,7 +5932,7 @@ static void megasas_detach_one(struct pci_dev
> *pdev)
> fusion->max_map_sz,
> fusion->ld_map[i],
> fusion->ld_map_phys[i]);
> - if (fusion->ld_drv_map[i])
> + if (fusion->ld_drv_map[i]) {
>   free_pages((ulong)fusion->ld_drv_map[i],
>   fusion->drv_map_pages);
>   if (fusion->pd_seq_sync)
> @@ -5940,6 +5940,7 @@ static void megasas_detach_one(struct pci_dev
> *pdev)
>   pd_seq_map_sz,
>   fusion->pd_seq_sync[i],
>   fusion->pd_seq_phys[i]);
> + }
>   }
>   free_pages((ulong)instance->ctrl_context,
>   instance->ctrl_context_pages);
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [patch 1/2] megaraid_sas: remove a stray tab

2015-11-10 Thread Kashyap Desai
> -Original Message-
> From: Dan Carpenter [mailto:dan.carpen...@oracle.com]
> Sent: Saturday, November 07, 2015 9:46 PM
> To: Kashyap Desai; Sumit Saxena
> Cc: Uday Lingala; James E.J. Bottomley; megaraidlinux@avagotech.com;
> linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org; kernel-
> janit...@vger.kernel.org
> Subject: [patch 1/2] megaraid_sas: remove a stray tab
>
> One line was indented more than the others.
>
> Signed-off-by: Dan Carpenter 

Let us combine your next patch (because that is also indentation issue)
and this one. We will resubmit patch.   For now ignore to avoid any issue
for pending patches to be committed.

>
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> b/drivers/scsi/megaraid/megaraid_sas_base.c
> index cc95372..829e9e9 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -5923,9 +5923,9 @@ static void megasas_detach_one(struct pci_dev
> *pdev)
>
>   if (instance->ctrl_context) {
>   megasas_release_fusion(instance);
> - pd_seq_map_sz = sizeof(struct
> MR_PD_CFG_SEQ_NUM_SYNC) +
> + pd_seq_map_sz = sizeof(struct
> MR_PD_CFG_SEQ_NUM_SYNC) +
>   (sizeof(struct MR_PD_CFG_SEQ) *
> - (MAX_PHYSICAL_DEVICES - 1));
> + (MAX_PHYSICAL_DEVICES - 1));
>   for (i = 0; i < 2 ; i++) {
>   if (fusion->ld_map[i])
>   dma_free_coherent(&instance->pdev->dev,
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH RESEND 09/25] mpt3sas: Don't send PHYDISK_HIDDEN Raid Action request on SAS2 HBA's

2015-11-11 Thread Kashyap Desai
> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Wednesday, November 11, 2015 6:27 PM
> To: Sreekanth Reddy; j...@kernel.org
> Cc: martin.peter...@oracle.com; linux-scsi@vger.kernel.org;
> jbottom...@parallels.com; sathya.prak...@avagotech.com;
> kashyap.de...@avagotech.com; linux-ker...@vger.kernel.org;
> h...@infradead.org; chaitra.basa...@avagotech.com; suganath-
> prabu.subram...@avagotech.com
> Subject: Re: [PATCH RESEND 09/25] mpt3sas: Don't send PHYDISK_HIDDEN
> Raid Action request on SAS2 HBA's
>
> On 11/11/2015 01:00 PM, Sreekanth Reddy wrote:
> > From: Sreekanth Reddy 
> >
> > Don't send PHYDISK_HIDDEN Raid Action request for SAS2 HBA's.
> > Since these HBA's doesn't support this Raid Action.
> >
> > Also enable fast_path only for SAS3 HBA's.
> >
> > Signed-off-by: Sreekanth Reddy 
> > ---
> >  drivers/scsi/mpt3sas/mpt3sas_scsih.c | 19 +--
> >  1 file changed, 17 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > index a638920..80469d0 100644
> > --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> > @@ -1165,8 +1165,10 @@ scsih_target_alloc(struct scsi_target *starget)
> > if (test_bit(sas_device->handle, ioc->pd_handles))
> > sas_target_priv_data->flags |=
> > MPT_TARGET_FLAGS_RAID_COMPONENT;
> > +#ifndef SCSI_MPT2SAS
> > if (sas_device->fast_path)
> > sas_target_priv_data->flags |=
> MPT_TARGET_FASTPATH_IO;
> > +#endif
> > }
> > spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
> >
> > @@ -3719,11 +3721,13 @@ scsih_qcmd(struct Scsi_Host *shost, struct
> scsi_cmnd *scmd)
> > ioc->build_zero_len_sge(ioc, &mpi_request->SGL);
> >
> > if (likely(mpi_request->Function ==
> MPI2_FUNCTION_SCSI_IO_REQUEST))
> > {
> > +#ifndef SCSI_MPT2SAS
> > if (sas_target_priv_data->flags &
> MPT_TARGET_FASTPATH_IO) {
> > mpi_request->IoFlags = cpu_to_le16(scmd-
> >cmd_len |
> > MPI25_SCSIIO_IOFLAGS_FAST_PATH);
> > mpt3sas_base_put_smid_fast_path(ioc, smid,
> handle);
> > } else
> > +#endif
> > mpt3sas_base_put_smid_scsi_io(ioc, smid, handle);
> > } else
> > mpt3sas_base_put_smid_default(ioc, smid); @@ -5031,8
> +5035,10 @@
> > _scsih_add_device(struct MPT3SAS_ADAPTER *ioc, u16 handle, u8
> phy_num,
> > sas_device->device_info = device_info;
> > sas_device->sas_address = sas_address;
> > sas_device->phy = sas_device_pg0.PhyNum;
> > +#ifndef SCSI_MPT2SAS
> > sas_device->fast_path = (le16_to_cpu(sas_device_pg0.Flags) &
> > MPI25_SAS_DEVICE0_FLAGS_FAST_PATH_CAPABLE) ? 1 : 0;
> > +#endif
> >
> > if (sas_device_pg0.Flags &
> MPI2_SAS_DEVICE0_FLAGS_ENCL_LEVEL_VALID) {
> > sas_device->enclosure_level =
> > @@ -5731,6 +5737,7 @@ _scsih_sas_discovery_event(struct
> MPT3SAS_ADAPTER *ioc,
> > }
> >  }
> >
> > +#ifndef SCSI_MPT2SAS
> >  /**
> >   * _scsih_ir_fastpath - turn on fastpath for IR physdisk
> >   * @ioc: per adapter object
> > @@ -5750,7 +5757,6 @@ _scsih_ir_fastpath(struct MPT3SAS_ADAPTER
> *ioc, u16 handle, u8 phys_disk_num)
> > u16 ioc_status;
> > u32 log_info;
> >
> > -
> > mutex_lock(&ioc->scsih_cmds.mutex);
> >
> > if (ioc->scsih_cmds.status != MPT3_CMD_NOT_USED) { @@ -
> 5825,6
> > +5831,8 @@ _scsih_ir_fastpath(struct MPT3SAS_ADAPTER *ioc, u16
> handle, u8 phys_disk_num)
> > FORCE_BIG_HAMMER);
> > return rc;
> >  }
> > +/* End of not defined SCSI_MPT2SAS */ #endif
> >
> >  /**
> >   * _scsih_reprobe_lun - reprobing lun @@ -6017,8 +6025,10 @@
> > _scsih_sas_pd_hide(struct MPT3SAS_ADAPTER *ioc,
> > if (!sas_device)
> > return;
> >
> > +#ifndef SCSI_MPT2SAS
> > /* hiding raid component */
> > _scsih_ir_fastpath(ioc, handle, element->PhysDiskNum);
> > +#endif
> > if (starget)
> > starget_for_each_device(starget, (void *)1,
> _scsih_reprobe_lun);  }
> > @@ -6067,7 +6077,9 @@ _scsih_sas_pd_add(struct MPT3SAS_ADAPTER
> *ioc,
> > sas_device = _scsih_sas_device_find_by_handle(ioc, handle);
> > spin_unlock_irqrestore(&ioc->sas_device_lock, flags);
> > if (sas_device) {
> > +#ifndef SCSI_MPT2SAS
> > _scsih_ir_fastpath(ioc, handle, element->PhysDiskNum);
> > +#endif
> > return;
> > }
> >
> > @@ -6091,7 +6103,9 @@ _scsih_sas_pd_add(struct MPT3SAS_ADAPTER
> *ioc,
> > mpt3sas_transport_update_links(ioc, sas_address, handle,
> > sas_device_pg0.PhyNum,
> MPI2_SAS_NEG_LINK_RATE_1_5);
> >
> > +#ifndef SCSI_MPT2SAS
> > _scsih_ir_fastpath(ioc, handle, element->PhysDiskNum);
> > +#endif
> > _scsih_add_device(ioc, handle, 0, 1);  }
> >
> > @@ -6202,13 +6216,14 @@ _scsih_sas_ir_config_change_event(struct
> > MPT3SAS_ADAPTER *ioc,
> >
> >

RE: How to get more sequential IO merged at elevator

2014-05-07 Thread Kashyap Desai
> -Original Message-
> From: Jens Axboe [mailto:ax...@kernel.dk]
> Sent: Tuesday, May 06, 2014 7:47 PM
> To: Desai, Kashyap
> Cc: linux-scsi@vger.kernel.org
> Subject: Re: How to get more sequential IO merged at elevator
>
> On 05/06/2014 04:06 AM, Desai, Kashyap wrote:
> > I got some clue on what was going on while doing 4K sequential read
using
> fio.
> >
> > If I use ioengine in fio script as "libaio", I see " do_io_submit"
call
> plug/unplug the queue before submitting the IO.
> > It means after every IO, we expect to send IO immediately to the next
> layer. If at all there are any pending IO do merge.. but not due to
plugging.
> >
> > This is what happens on my test. Every time IO comes from application
> > with libaio engine, it send down to the elevator/io-scheduler because
> queue was unplugged in do_io_submit(). Moment I reduce the queue depth
> of the block device, merge start because of congestion at scsi mid
layer.
> >
> > If I use, mmap engine, I see merged IO coming to the device driver
> > because of plugging. I really don't know how it works, but gave a try
> > and found merge happen because of plugging. ( I confirm using
> > blktrace)
> >
> > Is there any ioengine in  (or any other parameter setting), which
can
> use plugging mechanism of block layer to merge more IO other than mmap ?
>
> O_DIRECT IO is sync by nature, which is why it is sent off immediately
instead
> of held for potentially merging. mmap is not. You should be able to
provoke
> merging by submitting more than 1 IO at the time. See the iodepth_batch
> settings for fio.

Thanks Jens. I got your point about O_DIRECT IO. When I read  page of
 it mentioned default iodepth_batch and iodepth_low will be same as
iodepth value.. but in my case I have to explicitly provide those
parameters. It looks like default value for ipdepth_bacch/low is 1.

Once I provide fio parameter "iodepth_batch=32" and "iodepth_low=32", I
see IO in batch before plug and after unplug from application.
I am able to get now max merged IO. Thanks for helping me on this.

~ Kashyap

>
> --
> Jens Axboe
>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: How to get more sequential IO merged at elevator

2014-05-07 Thread Kashyap Desai
> -Original Message-
> From: Jens Axboe [mailto:ax...@kernel.dk]
> Sent: Tuesday, May 06, 2014 7:47 PM
> To: Desai, Kashyap
> Cc: linux-scsi@vger.kernel.org
> Subject: Re: How to get more sequential IO merged at elevator
>
> On 05/06/2014 04:06 AM, Desai, Kashyap wrote:
> > I got some clue on what was going on while doing 4K sequential read
using
> fio.
> >
> > If I use ioengine in fio script as "libaio", I see " do_io_submit"
call
> plug/unplug the queue before submitting the IO.
> > It means after every IO, we expect to send IO immediately to the next
> layer. If at all there are any pending IO do merge.. but not due to
plugging.
> >
> > This is what happens on my test. Every time IO comes from application
> > with libaio engine, it send down to the elevator/io-scheduler because
> queue was unplugged in do_io_submit(). Moment I reduce the queue depth
> of the block device, merge start because of congestion at scsi mid
layer.
> >
> > If I use, mmap engine, I see merged IO coming to the device driver
> > because of plugging. I really don't know how it works, but gave a try
> > and found merge happen because of plugging. ( I confirm using
> > blktrace)
> >
> > Is there any ioengine in  (or any other parameter setting), which
can
> use plugging mechanism of block layer to merge more IO other than mmap ?
>
> O_DIRECT IO is sync by nature, which is why it is sent off immediately
instead
> of held for potentially merging. mmap is not. You should be able to
provoke
> merging by submitting more than 1 IO at the time. See the iodepth_batch
> settings for fio.

Thanks Jens. I got your point about O_DIRECT IO. When I read  page of
 it mentioned default iodepth_batch and iodepth_low will be same as
iodepth value.. but in my case I have to explicitly provide those
parameters. (I explore this after you pointed out ). It looks like default
value for iodepth_batch/low is not same as iodepth.

Once I provide fio parameter "iodepth_batch=32" and "iodepth_low=32", I
see IO in batch before plug and after unplug from application.
I am able to get now max merged IO. Thanks for helping me on this.

~ Kashyap
>
> --
> Jens Axboe
>
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-17 Thread Kashyap Desai
V1 -> V2
Added fix in __blk_mq_finish_request around blk_mq_put_tag() for
non-internal tags

Problem statement :
Whenever try to get outstanding request via scsi_host_find_tag,
block layer will return stale entries instead of actual outstanding
request. Kernel panic if stale entry is inaccessible or memory is reused.
Fix :
Undo request mapping in blk_mq_put_driver_tag  nce request is return.

More detail :
Whenever each SDEV entry is created, block layer allocate separate tags
and static requestis.Those requests are not valid after SDEV is deleted
from the system. On the fly, block layer maps static rqs to rqs as below
from blk_mq_get_driver_tag()

data.hctx->tags->rqs[rq->tag] = rq;

Above mapping is active in-used requests and it is the same mapping which
is referred in function scsi_host_find_tag().
After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some
entries which will never be reset in block layer.

There would be a kernel panic, If request pointing to
“data.hctx->tags->rqs[rq->tag]” is part of “sdev” which is removed
and as part of that all the memory allocation of request associated with
that sdev might be reused or inaccessible to the driver.
Kernel panic snippet -

BUG: unable to handle kernel paging request at ff800010
IP: [] mpt3sas_scsih_scsi_lookup_get+0x6c/0xc0 [mpt3sas]
PGD aa4414067 PUD 0
Oops:  [#1] SMP
Call Trace:
 [] mpt3sas_get_st_from_smid+0x1f/0x60 [mpt3sas]
 [] scsih_shutdown+0x55/0x100 [mpt3sas]

Cc: 
Signed-off-by: Kashyap Desai 
Signed-off-by: Sreekanth Reddy 

---
 block/blk-mq.c | 4 +++-
 block/blk-mq.h | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6a75662..88d1e92 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -477,8 +477,10 @@ static void __blk_mq_free_request(struct request *rq)
 const int sched_tag = rq->internal_tag;

 blk_pm_mark_last_busy(rq);
-if (rq->tag != -1)
+if (rq->tag != -1) {
+hctx->tags->rqs[rq->tag] = NULL;
 blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
+}
 if (sched_tag != -1)
 blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
 blk_mq_sched_restart(hctx);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9497b47..57432be 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -175,6 +175,7 @@ static inline bool
blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
 static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
struct request *rq)
 {
+hctx->tags->rqs[rq->tag] = NULL;
 blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
 rq->tag = -1;

-- 
1.8.3.1


RE: [PATCH V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-18 Thread Kashyap Desai
> >
> > Other block drivers (e.g. ib_srp, skd) do not need this to work
> > reliably.
> > It has been explained to you that the bug that you reported can be
> > fixed by modifying the mpt3sas driver. So why to fix this by modifying
> > the block layer? Additionally, what prevents that a race condition
> > occurs between the block layer clearing hctx->tags->rqs[rq->tag] and
> > scsi_host_find_tag() reading that same array element? I'm afraid that
> > this is an attempt to paper over a real problem instead of fixing the
> > root
> cause.
>
> I have to agree with Bart here, I just don't see how the mpt3sas use case
> is
> special. The change will paper over the issue in any case.

Hi Jens, Bart

One of the key requirement for iterating whole tagset  using
scsi_host_find_tag is to block scsi host. Once we are done that, we should
be good. No race condition is possible if that part is taken care.
Without this patch, if driver still receive scsi command from the
hctx->tags->rqs which is really not outstanding.  I am finding this is
common issue for many scsi low level drivers.

Just for example  - fnic_is_abts_pending() function has below code -

for (tag = 0; tag < fnic->fnic_max_tag_id; tag++) {
sc = scsi_host_find_tag(fnic->lport->host, tag);
/*
 * ignore this lun reset cmd or cmds that do not belong to
 * this lun
 */
if (!sc || (lr_sc && (sc->device != lun_dev || sc ==
lr_sc)))
continue;

Above code also has similar exposure of kernel panic like  driver
while accessing sc->device.

Panic is more obvious if we have add/removal of scsi device before looping
through scsi_host_find_tag().

Avoiding block layer changes is also attempted in  but our problem
is to convert that code common for non-mq and mq.
Temporary to unblock this issue, We have fixed  using driver
internals scsiio_tracker() instead of piggy back in scsi_command.

Kashyap

>
> --
> Jens Axboe


RE: [PATCH V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-18 Thread Kashyap Desai
>
> On 12/18/18 10:08 AM, Kashyap Desai wrote:
> >>>
> >>> Other block drivers (e.g. ib_srp, skd) do not need this to work
> >>> reliably.
> >>> It has been explained to you that the bug that you reported can be
> >>> fixed by modifying the mpt3sas driver. So why to fix this by
> >>> modifying the block layer? Additionally, what prevents that a race
> >>> condition occurs between the block layer clearing
> >>> hctx->tags->rqs[rq->tag] and
> >>> scsi_host_find_tag() reading that same array element? I'm afraid
> >>> that this is an attempt to paper over a real problem instead of
> >>> fixing the root
> >> cause.
> >>
> >> I have to agree with Bart here, I just don't see how the mpt3sas use
> >> case is special. The change will paper over the issue in any case.
> >
> > Hi Jens, Bart
> >
> > One of the key requirement for iterating whole tagset  using
> > scsi_host_find_tag is to block scsi host. Once we are done that, we
> > should be good. No race condition is possible if that part is taken
> > care.
> > Without this patch, if driver still receive scsi command from the
> > hctx->tags->rqs which is really not outstanding.  I am finding this is
> > common issue for many scsi low level drivers.
> >
> > Just for example  - fnic_is_abts_pending() function has below
> > code -
> >
> > for (tag = 0; tag < fnic->fnic_max_tag_id; tag++) {
> > sc = scsi_host_find_tag(fnic->lport->host, tag);
> > /*
> >  * ignore this lun reset cmd or cmds that do not belong
> > to
> >  * this lun
> >  */
> > if (!sc || (lr_sc && (sc->device != lun_dev || sc ==
> > lr_sc)))
> > continue;
> >
> > Above code also has similar exposure of kernel panic like 
> > driver while accessing sc->device.
> >
> > Panic is more obvious if we have add/removal of scsi device before
> > looping through scsi_host_find_tag().
> >
> > Avoiding block layer changes is also attempted in  but our
> > problem is to convert that code common for non-mq and mq.
> > Temporary to unblock this issue, We have fixed  using driver
> > internals scsiio_tracker() instead of piggy back in scsi_command.
>
> For mq, the requests never go out of scope, they are always valid. So the
> key
> question here is WHY they have been freed. If the queue gets killed, then
> one
> potential solution would be to clear pointers in the tag map belonging to
> that
> queue. That also takes it out of the hot path.

At driver load whenever driver does scsi_add_host_with_dma(), it follows
below code path in block layer.

scsi_mq_setup_tags
  ->blk_mq_alloc_tag_set
  -> blk_mq_alloc_rq_maps
 -> __blk_mq_alloc_rq_maps

SML create two set of request pool. One is per HBA and other is per SDEV. I
was confused why SML creates request pool per HBA.

Example - IF HBA queue depth is 1K and there are 8 device behind that HBA,
total request pool is created is 1K + 8 * scsi_device queue depth. 1K will
be always static, but other request pool is managed whenever scsi device is
added/removed.

I never observe requests allocated per HBA is used in IO path. It is always
request allocated per scsi device is what active.
Also, what I observed is whenever scsi_device is deleted, associated request
is also deleted. What is missing is - "Deleted request still available in
hctx->tags->rqs[rq->tag]."

IF there is an assurance that all the request will be valid as long as hctx
is available, this patch is not correct. I posted patch based on assumption
that request per hctx can be removed whenever scsi device is removed.

Kashyap

>
> --
> Jens Axboe


RE: [PATCH V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-18 Thread Kashyap Desai
> On 12/18/18 10:48 AM, Kashyap Desai wrote:
> >>
> >> On 12/18/18 10:08 AM, Kashyap Desai wrote:
> >>>>>
> >>>>> Other block drivers (e.g. ib_srp, skd) do not need this to work
> >>>>> reliably.
> >>>>> It has been explained to you that the bug that you reported can be
> >>>>> fixed by modifying the mpt3sas driver. So why to fix this by
> >>>>> modifying the block layer? Additionally, what prevents that a race
> >>>>> condition occurs between the block layer clearing
> >>>>> hctx->tags->rqs[rq->tag] and
> >>>>> scsi_host_find_tag() reading that same array element? I'm afraid
> >>>>> that this is an attempt to paper over a real problem instead of
> >>>>> fixing the root
> >>>> cause.
> >>>>
> >>>> I have to agree with Bart here, I just don't see how the mpt3sas
> >>>> use case is special. The change will paper over the issue in any
> >>>> case.
> >>>
> >>> Hi Jens, Bart
> >>>
> >>> One of the key requirement for iterating whole tagset  using
> >>> scsi_host_find_tag is to block scsi host. Once we are done that, we
> >>> should be good. No race condition is possible if that part is taken
> >>> care.
> >>> Without this patch, if driver still receive scsi command from the
> >>> hctx->tags->rqs which is really not outstanding.  I am finding this
> >>> hctx->tags->is
> >>> common issue for many scsi low level drivers.
> >>>
> >>> Just for example  - fnic_is_abts_pending() function has below
> >>> code -
> >>>
> >>> for (tag = 0; tag < fnic->fnic_max_tag_id; tag++) {
> >>> sc = scsi_host_find_tag(fnic->lport->host, tag);
> >>> /*
> >>>  * ignore this lun reset cmd or cmds that do not
> >>> belong to
> >>>  * this lun
> >>>  */
> >>> if (!sc || (lr_sc && (sc->device != lun_dev || sc ==
> >>> lr_sc)))
> >>> continue;
> >>>
> >>> Above code also has similar exposure of kernel panic like 
> >>> driver while accessing sc->device.
> >>>
> >>> Panic is more obvious if we have add/removal of scsi device before
> >>> looping through scsi_host_find_tag().
> >>>
> >>> Avoiding block layer changes is also attempted in  but our
> >>> problem is to convert that code common for non-mq and mq.
> >>> Temporary to unblock this issue, We have fixed  using
> >>> driver internals scsiio_tracker() instead of piggy back in
> >>> scsi_command.
> >>
> >> For mq, the requests never go out of scope, they are always valid. So
> >> the key question here is WHY they have been freed. If the queue gets
> >> killed, then one potential solution would be to clear pointers in the
> >> tag map belonging to that queue. That also takes it out of the hot
> >> path.
> >
> > At driver load whenever driver does scsi_add_host_with_dma(), it
> > follows below code path in block layer.
> >
> > scsi_mq_setup_tags
> >   ->blk_mq_alloc_tag_set
> >   -> blk_mq_alloc_rq_maps
> >  -> __blk_mq_alloc_rq_maps
> >
> > SML create two set of request pool. One is per HBA and other is per
> > SDEV. I was confused why SML creates request pool per HBA.
> >
> > Example - IF HBA queue depth is 1K and there are 8 device behind that
> > HBA, total request pool is created is 1K + 8 * scsi_device queue
> > depth. 1K will be always static, but other request pool is managed
> > whenever scsi device is added/removed.
> >
> > I never observe requests allocated per HBA is used in IO path. It is
> > always request allocated per scsi device is what active.
> > Also, what I observed is whenever scsi_device is deleted, associated
> > request is also deleted. What is missing is - "Deleted request still
> > available in
> > hctx->tags->rqs[rq->tag]."
>
> So that sounds like the issue. If the device is deleted and its requests
> go away,
> those pointers should be cleared. That's what your patch should do, not do
> it
> for each IO.

At the time of device removal,  it requires reverse traversing.  Find out if
each requests associated with sdev is part of hctx->tags->rqs() and clear
that entry.
Not sure about atomic traverse if more than one device removal is happening
in parallel.  May be more error prone. ?

Just wondering both the way we will be removing invalid request from array.
Are you suspecting any performance issue if we do it per IO ?

Kashyap

>
>
> --
> Jens Axboe


RE: [PATCH V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-18 Thread Kashyap Desai
> >
> > At the time of device removal,  it requires reverse traversing.  Find
> > out if each requests associated with sdev is part of hctx->tags->rqs()
> > and clear that entry.
> > Not sure about atomic traverse if more than one device removal is
> > happening in parallel.  May be more error prone. ?
> >
> > Just wondering both the way we will be removing invalid request from
> > array.
> > Are you suspecting any performance issue if we do it per IO ?
>
> It's an extra store, and it's a store to an area that's then now shared
> between
> issue and completion. Those are never a good idea. Besides, it's the kind
> of
> issue you solve in the SLOW path, not in the fast path. Since that's
> doable, it
> would be silly to do it for every IO.
>
> This might not matter on mpt3sas, but on more efficient hw it definitely
> will.

Understood your primary concern is to avoid per IO and do it if no better
way.

> I'm still trying to convince myself that this issue even exists. I can see
> having
> stale entries, but those should never be busy. Why are you finding them
> with
> the tag iteration? It must be because the tag is reused, and you are
> finding it
> before it's re-assigned?


Stale entries will be forever if we remove scsi devices. It is not timing
issue. If memory associated with request (freed due to device removal)
reused, kernel panic occurs.
We have 24 Drives behind Expander and follow expander reset which will
remove all 24 drives and add it back. Add and removal of all the drives
happens quickly.
As part of Expander reset,  driver process broadcast primitive
event and that requires finding all outstanding scsi command.

In some cases, we need firmware restart and that path also requires tag
iteration.


>
> --
> Jens Axboe


RE: [PATCH V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-18 Thread Kashyap Desai
>
> I actually took a look at scsi_host_find_tag() - what I think needs fixing
> here is
> that it should not return a tag that isn't allocated.
> You're just looking up random stuff, that is a recipe for disaster.
> But even with that, there's no guarantee that the tag isn't going away.

Got your point. Let us fix in  driver.

>
> The mpt3sas use case is crap. It's iterating every tag, just in case it
> needs to do
> something to it.

Many drivers in scsi layer is having similar trouble.  May be they are less
exposed. That was a main reason, I thought to provide common fix in block
layer.

>
> My suggestion would be to scrap that bad implementation and have
> something available for iterating busy tags instead. That'd be more
> appropriate and a lot more efficient that a random loop from 0..depth.
> If you are flushing running commands, looking up tags that aren't even
> active
> is silly and counterproductive.

We will address this issue through  driver changes in two steps.
1. I can use  driver's internal memory and will not rely on request/scsi
command. Tag 0...depth loop  is not in main IO path, so what we need is
contention free access of the list. Having driver's internal memory and
array will provide that control.
2. As you suggested best way is to use busy tag iteration.  (only for blk-mq
stack)


Thanks for your feedback.

Kashyap

>
> --
> Jens Axboe


RE: [PATCH] megaraid_sas: enable blk-mq for fusion

2019-01-11 Thread Kashyap Desai
> Fusion adapters can steer completions to individual queues, so we can
enable
> blk-mq for those adapters.
> And in doing so we can rely on the interrupt affinity from the block
layer and
> drop the hand-crafted 'reply_map' construct.

Hannes, I understand the intend of this patch as we discussed couple of
time on same topic in past. We really don't need such interface in MR/IT
HBA because we have single submission queue (h/w registers.).
Mimicing multiple h/w submission queue really does not help any
performance. Relevant discussion at below -

https://marc.info/?l=linux-scsi&m=153078454611933&w=2

We can hit max h/w and storage performance limit using nr_hw_queues = 1.
We could not find any reason to really setup more than one hw queue.

Kashyap


Re: [PATCH v1] mpt3sas: Use driver scsi lookup to track outstanding IOs

2019-02-26 Thread Kashyap Desai
On Tue, Feb 26, 2019 at 8:23 PM Hannes Reinecke  wrote:
>
> On 2/26/19 3:33 PM, Christoph Hellwig wrote:
> > On Tue, Feb 26, 2019 at 02:49:30PM +0100, Hannes Reinecke wrote:
> >> Attached is a patch to demonstrate my approach.
> >> I am aware that it'll only be useful for latest upstream where the legacy
> >> I/O path has been dropped completely, so we wouldn't need to worry about 
> >> it.
> >> For older releases indeed you would need to with something like your
> >> proposed patch, but for upstream I really would like to switch to
> >> blk_mq_tagset_busy() iter.
> >
> > While this is better than the driver private tracking we really should
> > not have to iterate all outstanding command, because if we have any
> > there is a bug we need to fix in the higher layers instead of working
> > around it in the drivers.

Hi Chris, Looking at other driver's code, I think similar issue is
impacted to many scsi hbas (like fnic, qla4xxx, snic etc..). They also
need similar logic to traverse outstanding scsi commands.
One of the example is - At the time of HBA reset driver would like to
release scsi command back to SML to retry and for that driver requires
to loop all possible smid to figure out outstanding scsi command
@Firmware/SML level.

> >
> Ah-ha.
>
> But what else should we be doing here?
> The driver needs to abort all outstanding commands; I somewhat fail to
> see how we could be doing it otherwise ...

Hannes, Primary drawback of using "blk_mq_tagset_busy_iter" is API is
only for blk-mq and it is not available for all the kernel with blk-mq
support. We have seen multiple failures from customer and those
kernels does not support blk_mq_tagset_busy_iter. In fact, blk-mq and
non-mq stack is alive in many Linux distribution and customer is using
those. If we just scope to fix current kernel 5.x (which does not have
non-mq support), we can opt blk_mq_tagset_busy_iter(). Earlier I
requested upstream to accept driver changes without
blk_mq_tagset_busy_iter() because it is hitting many customers so we
are looking for generic fix which can server both blk-mq as well as
non-mq.

We will certainly enhance and optimize working this area (driver
trying to find outstanding scsi command) with scsi mailing list as
incremental approach.

Kashyap

>
> Cheers,
>
> Hannes
>


[scsi] write same with NDOB (no data-out buffer) support

2019-02-26 Thread Kashyap Desai
Hi Martin/Chris,

I was going through below discussion as well as going through linux
scsi code to know if linux scsi stack support NDOB. It looks like
current scsi stack does not have support of NDOB.  From below thread,
it looks like NDOB is not good way to support considering lack of
device level support is not stable. Below is bit old discussion, so
want to understand if current situation is good to introduce NDOB
support (considering underlying device works as expected with NDOB) ?

https://lore.kernel.org/patchwork/patch/673046/

One more question. What happens if WS w/ UNMAP command is passed to
the device without zeroed data out buffer in current scsi stack ? Will
it permanently disable WS on that device ?

Thanks, Kashyap


  1   2   >