RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661]
Bjorn/Robin, Apologies for delay. Here is one quick suggestion as we have seen similar issue (not exactly similar, but high probably to have same issue) while controller is configured on VM as pass-through and VM reboot abruptly. In that particular issue, driver interact with FW which may require chip reset to bring controller to operation state. Relevant patch was submitted for only Older controller as it was only seen for few MegaRaid controller. Below patch already try to do chip reset, but only for limited controllers...I have attached one more patch which does chip reset from driver load time for Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt controller, so attached patch is required.) http://www.spinics.net/lists/linux-scsi/msg67288.html Please post the result with attached patch. Thanks, Kashyap > -Original Message- > From: Bjorn Helgaas [mailto:bhelg...@google.com] > Sent: Thursday, May 28, 2015 5:55 PM > To: Robin H. Johnson > Cc: Adam Radford; Neela Syam Kolli; linux-scsi@vger.kernel.org; > arkadiusz.bub...@open-e.com; Matthew Garrett; Kashyap Desai; Sumit Saxena; > Uday Lingala; megaraidlinux@avagotech.com; linux-...@vger.kernel.org; > linux-ker...@vger.kernel.org; Jean Delvare; Myron Stowe > Subject: Re: megaraid_sas: "FW in FAULT state!!", how to get more debug > output? [BKO63661] > > [+cc Jean, Myron] > > Hello megaraid maintainers, > > Have you been able to take a look at this at all? People have been reporting this > issue since 2012 on upstream, Debian, and Ubuntu, and now we're getting > reports on SLES. > > My theory is that the Linux driver relies on some MegaRAID initialization done by > the option ROM, and the bug happens when the BIOS doesn't execute the option > ROM. > > If that's correct, you should be able to reproduce it on any system by booting > Linux (v3.3 or later) without running the MegaRAID SAS 2208 option ROM (either > by enabling a BIOS "fast boot" switch, or modifying the BIOS to skip it). If the > Linux driver doesn't rely on the option ROM, you might even be able to > reproduce it by physically removing the option ROM from the MegaRAID. > > Bjorn > > On Wed, Apr 29, 2015 at 12:28:32PM -0500, Bjorn Helgaas wrote: > > [+cc linux-pci, linux-kernel, Kashyap, Sumit, Uday, megaraidlinux.pdl] > > > > On Sun, Jul 13, 2014 at 01:35:51AM +, Robin H. Johnson wrote: > > > On Sat, Jul 12, 2014 at 11:29:20AM -0600, Bjorn Helgaas wrote: > > > > Thanks for the report, Robin. > > > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=63661 bisected the > > > > problem to 3c076351c402 ("PCI: Rework ASPM disable code"), which > > > > appeared in v3.3. For starters, can you verify that, e.g., by > > > > building > > > > 69166fbf02c7 (the parent of 3c076351c402) to make sure that it > > > > works, and building 3c076351c402 itself to make sure it fails? > > > > > > > > Assuming that's the case, please attach the complete dmesg and > > > > "lspci -vvxxx" output for both kernels to the bugzilla. ASPM is a > > > > feature that is configured on both ends of a PCIe link, so I want > > > > to see the lspci info for the whole system, not just the SAS adapters. > > > > > > > > It's not practical to revert 3c076351c402 now, so I'd also like to > > > > see the same information for the newest possible kernel (if this > > > > is possible; I'm not clear on whether you can boot your system or > > > > not) so we can figure out what needs to be changed. > > > TL;DR: FastBoot is leaving the MegaRaidSAS in a weird state, and it > > > fails to start; Commit 3c076351c402 did make it worse, but I think > > > we're right that the bug lies in the SAS code. > > > > > > Ok, I have done more testing on it (40+ boots), and I think we can > > > show the problem is somewhere in how the BIOS/EFI/ROM brings up the > > > card in FastBoot more, or how it leaves the card. > > > > I attached your dmesg and lspci logs to > > https://bugzilla.kernel.org/show_bug.cgi?id=63661, thank you! You did > > a huge amount of excellent testing and analysis, and I'm sorry that we > > haven't made progress using the results. > > > > I still think this is a megaraid_sas driver bug, but I don't have > > enough evidence to really point fingers. > > > > Based on your testing, before 3c076351c402 ("PCI: Rework ASPM disable > > code"), megaraid_sas worked reliably. After 3c076351c402, > > megaraid_sas does not work
RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661]
Jean, Patch is available at below repo - git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git - b for-next Commit id - 6431f5d7c6025f8b007af06ea090de308f7e6881 If you share megaraid_sas driver code of your tree, I can provide patch for you. ` Kashyap > -Original Message- > From: Jean Delvare [mailto:jdelv...@suse.de] > Sent: Monday, June 29, 2015 6:55 PM > To: Kashyap Desai > Cc: Bjorn Helgaas; Robin H. Johnson; Adam Radford; Neela Syam Kolli; > linux- > s...@vger.kernel.org; arkadiusz.bub...@open-e.com; Matthew Garrett; Sumit > Saxena; Uday Lingala; PDL,MEGARAIDLINUX; linux-...@vger.kernel.org; linux- > ker...@vger.kernel.org; Myron Stowe > Subject: RE: megaraid_sas: "FW in FAULT state!!", how to get more debug > output? [BKO63661] > > Hi Kashyap, > > Thanks for the patch. May I ask what tree it was based on? Linus' > latest? I am trying to apply it to the SLES 11 SP3 and SLES 12 kernel > trees (based > on kernel v3.0 + a bunch of backports and v3.12 > respectively) but your patch fails to apply in both cases. I'll try harder > but I don't > know anything about the megaraid_sas code so I really don't know where I'm > going. > > Does your patch depend on any other that may not be present in the SLES > 11 SP3 and SLES 12 kernels? > > Thanks, > Jean > > Le Thursday 28 May 2015 à 19:05 +0530, Kashyap Desai a écrit : > > Bjorn/Robin, > > > > Apologies for delay. Here is one quick suggestion as we have seen > > similar issue (not exactly similar, but high probably to have same > > issue) while controller is configured on VM as pass-through and VM > > reboot > abruptly. > > In that particular issue, driver interact with FW which may require > > chip reset to bring controller to operation state. > > > > Relevant patch was submitted for only Older controller as it was only > > seen for few MegaRaid controller. Below patch already try to do chip > > reset, but only for limited controllers...I have attached one more > > patch which does chip reset from driver load time for > > Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt > > controller, so attached patch is required.) > > > > http://www.spinics.net/lists/linux-scsi/msg67288.html > > > > Please post the result with attached patch. > > > > Thanks, Kashyap > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661]
> -Original Message- > From: Jean Delvare [mailto:jdelv...@suse.de] > Sent: Tuesday, July 07, 2015 2:14 PM > To: Kashyap Desai > Cc: Bjorn Helgaas; Robin H. Johnson; Adam Radford; Neela Syam Kolli; linux- > s...@vger.kernel.org; arkadiusz.bub...@open-e.com; Matthew Garrett; Sumit > Saxena; Uday Lingala; PDL,MEGARAIDLINUX; linux-...@vger.kernel.org; linux- > ker...@vger.kernel.org; Myron Stowe > Subject: Re: megaraid_sas: "FW in FAULT state!!", how to get more debug > output? [BKO63661] > > Hi Kashyap, > > On Thu, 28 May 2015 19:05:35 +0530, Kashyap Desai wrote: > > Bjorn/Robin, > > > > Apologies for delay. Here is one quick suggestion as we have seen > > similar issue (not exactly similar, but high probably to have same > > issue) while controller is configured on VM as pass-through and VM reboot > abruptly. > > In that particular issue, driver interact with FW which may require > > chip reset to bring controller to operation state. > > > > Relevant patch was submitted for only Older controller as it was only > > seen for few MegaRaid controller. Below patch already try to do chip > > reset, but only for limited controllers...I have attached one more > > patch which does chip reset from driver load time for > > Thunderbolt/Invader/Fury etc. (In your case you have Thunderbolt > > controller, so attached patch is required.) > > > > http://www.spinics.net/lists/linux-scsi/msg67288.html > > > > Please post the result with attached patch. > > Good news! Customer tested your patch and said it fixed the problem :-) > > I am now in the process of backporting the patch to the SLES 11 SP3 kernel for > further testing. I'll let you know how it goes. Thank you very much for your > assistance. Thanks for confirmation. Whatever patch I submitted to you, we have added recently (as part of common interface approach to do chip reset at load time). We will be submitting that patch to mainline soon. ~ Kashyap > > -- > Jean Delvare > SUSE L3 Support -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: megaraid_sas: "FW in FAULT state!!", how to get more debug output? [BKO63661]
> > I am about to commit the patch that was successfully tested by the > customer on > SLES 12, but I'm a bit confused. The upstream patch you referred to is: > > https://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?h=for- > next&id=6431f5d7c6025f8b007af06ea090de308f7e6881 > [SCSI] megaraid_sas: megaraid_sas driver init fails in kdump kernel > > But the patch I used is the one you sent by e-mail on May 28th. It is > completely > different! > > So what am I supposed to do? Use the patch you sent (and that was tested > by > the customer) for SLES 11 SP3 and SLES 12? Or was it just for testing and > the > proper way of fixing the problem would be to backport the upstream commit? You can use that patch as valid candidate for upstream submission. Some of the MR maintainer (Sumit Saxena) will send that patch. We are just organizing other patch series. Since SLES already ported patch without commit id, we are fine. I am just giving reference that patch which send via email will be send to upstream very soon along with other patch set. Thanks, Kashyap > > Please advise, > -- > Jean Delvare > SUSE L3 Support -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 4/7] megaraid_sas: Send SYNCHRONIZE_CACHE command to firmware
> -Original Message- > From: James Bottomley [mailto:j...@linux.vnet.ibm.com] > Sent: Monday, October 17, 2016 10:50 PM > To: Ric Wheeler; Hannes Reinecke; Sumit Saxena; linux-scsi@vger.kernel.org > Cc: martin.peter...@oracle.com; the...@redhat.com; > kashyap.de...@broadcom.com; Christoph Hellwig; Martin K. Petersen; Jeff > Moyer; Gris Ge; Ewan Milne; Jens Axboe > Subject: Re: [PATCH 4/7] megaraid_sas: Send SYNCHRONIZE_CACHE command > to firmware > > On Mon, 2016-10-17 at 12:27 -0400, Ric Wheeler wrote: > > On 10/17/2016 12:20 PM, James Bottomley wrote: > > > On Mon, 2016-10-17 at 09:01 -0400, Ric Wheeler wrote: > > > > On 10/17/2016 07:34 AM, Hannes Reinecke wrote: > > > > > On 10/17/2016 12:24 PM, Sumit Saxena wrote: > > > > > > megaraid_sas driver returns SYNCHRONIZE_CACHE command back to > > > > > > SCSI layer without sending it to firmware as firmware takes > > > > > > care of flushing cache. This patch will change the driver > > > > > > behavior wrt SYNCHRONIZE_CACHE handling. If underlying > > > > > > firmware has support to handle the SYNCHORNIZE_CACHE, driver > > > > > > will send it for firmware otherwise complete it back to SCSI > > > > > > layer with SUCCESS immediately. If Firmware handle > > > > > > SYNCHORNIZE_CACHE for both VD and JBOD "canHandleSyncCache" > > > > > > bit in scratch pad register(offset > > > > > > 0x00B4) will be set. > > > > > > > > > > > > This behavior can be controlled via module parameter and user > > > > > > can fallback to old behavior of returning SYNCHRONIZE_CACHE by > > > > > > driver only without sending it to firmware. > > > > > > > > > > > > Signed-off-by: Sumit Saxena > > > > > > Signed-off-by: Kashyap Desai > > > > > > --- > > > > > >drivers/scsi/megaraid/megaraid_sas.h| 3 +++ > > > > > >drivers/scsi/megaraid/megaraid_sas_base.c | 14 ++--- > > > > > > - > > > > > >drivers/scsi/megaraid/megaraid_sas_fusion.c | 3 +++ > > > > > >drivers/scsi/megaraid/megaraid_sas_fusion.h | 2 ++ > > > > > >4 files changed, 14 insertions(+), 8 deletions(-) > > > > > > > > > > > > diff --git a/drivers/scsi/megaraid/megaraid_sas.h > > > > > > b/drivers/scsi/megaraid/megaraid_sas.h > > > > > > index ca86c88..43fd14f 100644 > > > > > > --- a/drivers/scsi/megaraid/megaraid_sas.h > > > > > > +++ b/drivers/scsi/megaraid/megaraid_sas.h > > > > > > @@ -1429,6 +1429,8 @@ enum FW_BOOT_CONTEXT { > > > > > >#define MR_MAX_REPLY_QUEUES_EXT_OFFSET_SHIFT14 > > > > > >#define MR_MAX_MSIX_REG_ARRAY 16 > > > > > >#define MR_RDPQ_MODE_OFFSET 0X00800 > > > > > > 000 > > > > > > +#define MR_CAN_HANDLE_SYNC_CACHE_OFFSET0X010 > > > > > > > > > > > > 0 > > > > > > + > > > > > >/* > > > > > >* register set for both 1068 and 1078 controllers > > > > > >* structure extended for 1078 registers @@ -2140,6 +2142,7 > > > > > > @@ struct megasas_instance { > > > > > > u8 is_imr; > > > > > > u8 is_rdpq; > > > > > > bool dev_handle; > > > > > > + bool fw_sync_cache_support; > > > > > >}; > > > > > >struct MR_LD_VF_MAP { > > > > > > u32 size; > > > > > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > > > > > > b/drivers/scsi/megaraid/megaraid_sas_base.c > > > > > > index 092387f..a4a8e2f 100644 > > > > > > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > > > > > > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > > > > > > @@ -104,6 +104,10 @@ unsigned int scmd_timeout = > > > > > > MEGASAS_DEFAULT_CMD_TIMEOUT; > > > > > >module_param(scmd_timeout, int, S_IRUGO); > > > > > >MODULE_PARM_DESC(scmd_timeout, "scsi command timeout (10 > > > > > > -90s), default 90s. See megasas_reset_timer."); > > > > > > > > > > > > +bool block_sync_cache; > >
Device or HBA QD throttling creates holes in Sequential work load
Hi, I am doing some performance tuning in MR driver to understand how sdev queue depth and hba queue depth play role in IO submission from above layer. I have 24 JBOD connected to MR 12GB controller and I can see performance for 4K Sequential work load as below. HBA QD for MR controller is 4065 Per device QD is set to 32 queue depth from 256 reports 300K IOPS queue depth from 128 reports 330K IOPS queue depth from 64 reports 360K IOPS queue depth from 32 reports 510K IOPS In MR driver I added debug print and confirm that more IO come to driver as random IO whenever I have queue depth more than 32. I have debug using scsi logging level and blktrace as well. Below is snippet of logs using scsi logging level. In summary, if SML do flow control of IO due to Device QD or HBA QD, IO coming to LLD is more random pattern. I see IO coming to driver is not sequential. [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0 3b 00 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00 00 03 c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB: Write(10) 2a 00 00 03 c0 5b 00 00 01 00 <- After 3c it jumps to 5b. Sequence are overlapped. Due to sdev QD throttling. [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0 5c 00 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00 00 03 c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB: Write(10) 2a 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] tag#857 CDB: Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd 18:2:21:0: [sdy] tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00 If scsi_request_fn() breaks due to unavailability of device queue (due to below check), will there be any side defect as I observe ? if (!scsi_dev_queue_ready(q, sdev)) break; If I reduce HBA QD and make sure IO from above layer is throttled due to HBA QD, there is a same impact. MR driver use host wide shared tag map. Can someone help me if this can be tunable in LLD providing additional settings or it is expected behavior ? Problem I am facing is, I am not able to figure out optimal device queue depth for different configuration and work load. ` Kashyap -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Device or HBA QD throttling creates holes in Sequential work load
Reply to see if email reached to linux-scsi@vger.kernel.org. On Thu, Oct 20, 2016 at 12:07 PM, Kashyap Desai wrote: > Hi, > > I am doing some performance tuning in MR driver to understand how sdev queue > depth and hba queue depth play role in IO submission from above layer. > > I have 24 JBOD connected to MR 12GB controller and I can see performance > for 4K Sequential work load as below. > > HBA QD for MR controller is 4065 and Per device QD is set to 32 > > > > queue depth from 256 reports 300K IOPS > > queue depth from 128 reports 330K IOPS > > queue depth from 64 reports 360K IOPS > > queue depth from 32 reports 510K IOPS > > > > In MR driver I added debug print and confirm that more IO come to driver as > random IO whenever I have queue depth more than 32. > > I have debug using scsi logging level and blktrace as well. Below is snippet > of logs using scsi logging level. In summary, if SML do flow control of IO > due to Device QD or HBA QD, IO coming to LLD is more random pattern. > > > > I see IO coming to driver is not sequential. > > > > [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0 3b > 00 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00 > 00 03 c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB: > Write(10) 2a 00 00 03 c0 5b 00 00 01 00 <- After 3c it jumps to 5b. Sequence > are overlapped. Due to sdev QD throttling. > > [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0 5c > 00 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00 > 00 03 c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB: > Write(10) 2a 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] > tag#857 CDB: Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd > 18:2:21:0: [sdy] tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00 > > > > If scsi_request_fn() breaks due to unavailability of device queue (due to > below check), will there be any side defect as I observe ? > > if (!scsi_dev_queue_ready(q, sdev)) > > break; > > If I reduce HBA QD and make sure IO from above layer is throttled due to > HBA QD, there is a same impact. MR driver use host wide shared tag map. > > Can someone help me if this can be tunable in LLD providing additional > settings or it is expected behavior ? Problem I am facing is, I am not able > to figure out optimal device queue depth for different configuration and > work load. > > ` Kashyap > > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Device or HBA QD throttling creates holes in Sequential work load
[ Apologize, if this thread is reached to you multiple time ] Hi, I am doing some performance tuning in MR driver to understand how sdev queue depth and hba queue depth play role in IO submission from above layer. I have 24 JBOD connected to MR 12GB controller and I can see performance for 4K Sequential work load as below. HBA QD for MR controller is 4065 and Per device QD is set to 32 queue depth from 256 reports 300K IOPS queue depth from 128 reports 330K IOPS queue depth from 64 reports 360K IOPS queue depth from 32 reports 510K IOPS In MR driver I added debug print and confirm that more IO come to driver as random IO whenever I have queue depth more than 32. I have debug using scsi logging level and blktrace as well. Below is snippet of logs using scsi logging level. In summary, if SML do flow control of IO due to Device QD or HBA QD, IO coming to LLD is more random pattern. I see IO coming to driver is not sequential. [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0 3b 00 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00 00 03 c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB: Write(10) 2a 00 00 03 c0 5b 00 00 01 00 <- After 3c it jumps to 5b. Sequence are overlapped. Due to sdev QD throttling. [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0 5c 00 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00 00 03 c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB: Write(10) 2a 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] tag#857 CDB: Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd 18:2:21:0: [sdy] tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00 If scsi_request_fn() breaks due to unavailability of device queue (due to below check), will there be any side defect as I observe ? if (!scsi_dev_queue_ready(q, sdev)) break; If I reduce HBA QD and make sure IO from above layer is throttled due to HBA QD, there is a same impact. MR driver use host wide shared tag map. Can someone help me if this can be tunable in LLD providing additional settings or it is expected behavior ? Problem I am facing is, I am not able to figure out optimal device queue depth for different configuration and work load. Thanks, Kashyap -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] Fix: scsi: megaraid: reduce the scope of pending-list lock to avoid double lock
> -Original Message- > From: i...@itu.dk [mailto:i...@itu.dk] > Sent: Monday, October 17, 2016 1:00 PM > To: Jiri Kosina > Cc: Kashyap Desai; Sumit Saxena; Uday Lingala; James E.J. Bottomley; Martin K. > Petersen; megaraidlinux@avagotech.com; linux-scsi@vger.kernel.org; Iago > Abal > Subject: [PATCH] Fix: scsi: megaraid: reduce the scope of pending-list lock to > avoid double lock > > From: Iago Abal > > The EBA code analyzer (https://github.com/models-team/eba) reported the > following double lock: > > 1. In function `megaraid_reset_handler' at 2571; > 2. take `&adapter->pend_list_lock' for the first time at 2602: > >// FIRST >spin_lock_irqsave(PENDING_LIST_LOCK(adapter), flags); > > 3. enter the `list_for_each_entry_safe' loop at 2603; > 4. call `megaraid_mbox_mm_done' at 2616; > 5. call `megaraid_mbox_runpendq' at 3782; > 6. take `&adapter->pend_list_lock' for the second time at 1892: > >// SECOND: DOUBLE LOCK !!! >spin_lock_irqsave(PENDING_LIST_LOCK(adapter), flags); > > From my shallow understanding of the code (so please review carefully), I think > that it is not necessary to hold `PENDING_LIST_LOCK(adapter)' while executing > the body of the `list_for_each_entry_safe' loop. I assume this because both > `megaraid_mbox_mm_done' and `megaraid_dealloc_scb' are called from > several places where, as far as I can tell, this lock is not hold. In fact, as reported > by EBA, at some point `megaraid_mbox_mm_done' will acquire this lock again. > > Fixes: c005fb4fb2d2 ("[SCSI] megaraid_{mm,mbox}: fix a bug in reset handler") > Signed-off-by: Iago Abal > --- > drivers/scsi/megaraid/megaraid_mbox.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/scsi/megaraid/megaraid_mbox.c > b/drivers/scsi/megaraid/megaraid_mbox.c > index f0987f2..7f11898 100644 > --- a/drivers/scsi/megaraid/megaraid_mbox.c > +++ b/drivers/scsi/megaraid/megaraid_mbox.c > @@ -2603,6 +2603,7 @@ static DEF_SCSI_QCMD(megaraid_queue_command) > list_for_each_entry_safe(scb, tmp, &adapter->pend_list, list) { > list_del_init(&scb->list); // from pending list > > + spin_unlock_irqrestore(PENDING_LIST_LOCK(adapter), flags); > if (scb->sno >= MBOX_MAX_SCSI_CMDS) { > con_log(CL_ANN, (KERN_WARNING > "megaraid: IOCTL packet with %d[%d:%d] being > reset\n", @@ -2630,6 +2631,7 @@ static > DEF_SCSI_QCMD(megaraid_queue_command) > > megaraid_dealloc_scb(adapter, scb); > } > + spin_lock_irqsave(PENDING_LIST_LOCK(adapter), flags); > } > spin_unlock_irqrestore(PENDING_LIST_LOCK(adapter), flags); Looks correct, but please note that MEGARAID_MAILBOX and MEGARAID_MM is not supported by LSI/ Broadcom. We will revert back to you shortly if we can safely remove those two modules. . > > -- > 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Device or HBA level QD throttling creates randomness in sequetial workload
[ Apologize, if you find more than one instance of my email. Web based email client has some issue, so now trying git send mail.] Hi, I am doing some performance tuning in MR driver to understand how sdev queue depth and hba queue depth play role in IO submission from above layer. I have 24 JBOD connected to MR 12GB controller and I can see performance for 4K Sequential work load as below. HBA QD for MR controller is 4065 and Per device QD is set to 32 queue depth from 256 reports 300K IOPS queue depth from 128 reports 330K IOPS queue depth from 64 reports 360K IOPS queue depth from 32 reports 510K IOPS In MR driver I added debug print and confirm that more IO come to driver as random IO whenever I have queue depth more than 32. I have debug using scsi logging level and blktrace as well. Below is snippet of logs using scsi logging level. In summary, if SML do flow control of IO due to Device QD or HBA QD, IO coming to LLD is more random pattern. I see IO coming to driver is not sequential. [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0 3b 00 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00 00 03 c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB: Write(10) 2a 00 00 03 c0 5b 00 00 01 00 After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b". Two Sequence are overlapped due to sdev QD throttling. [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0 5c 00 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00 00 03 c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB: Write(10) 2a 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] tag#857 CDB: Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd 18:2:21:0: [sdy] tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00 If scsi_request_fn() breaks due to unavailability of device queue (due to below check), will there be any side defect as I observe ? if (!scsi_dev_queue_ready(q, sdev)) break; If I reduce HBA QD and make sure IO from above layer is throttled due to HBA QD, there is a same impact. MR driver use host wide shared tag map. Can someone help me if this can be tunable in LLD providing additional settings or it is expected behavior ? Problem I am facing is, I am not able to figure out optimal device queue depth for different configuration and work load. Thanks, Kashyap -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Device or HBA level QD throttling creates randomness in sequetial workload
[ Apologize, if you find more than one instance of my email. Web based email client has some issue, so now trying git send mail.] Hi, I am doing some performance tuning in MR driver to understand how sdev queue depth and hba queue depth play role in IO submission from above layer. I have 24 JBOD connected to MR 12GB controller and I can see performance for 4K Sequential work load as below. HBA QD for MR controller is 4065 and Per device QD is set to 32 queue depth from 256 reports 300K IOPS queue depth from 128 reports 330K IOPS queue depth from 64 reports 360K IOPS queue depth from 32 reports 510K IOPS In MR driver I added debug print and confirm that more IO come to driver as random IO whenever I have queue depth more than 32. I have debug using scsi logging level and blktrace as well. Below is snippet of logs using scsi logging level. In summary, if SML do flow control of IO due to Device QD or HBA QD, IO coming to LLD is more random pattern. I see IO coming to driver is not sequential. [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0 3b 00 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00 00 03 c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB: Write(10) 2a 00 00 03 c0 5b 00 00 01 00 After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b". Two Sequence are overlapped due to sdev QD throttling. [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0 5c 00 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00 00 03 c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB: Write(10) 2a 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] tag#857 CDB: Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd 18:2:21:0: [sdy] tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00 If scsi_request_fn() breaks due to unavailability of device queue (due to below check), will there be any side defect as I observe ? if (!scsi_dev_queue_ready(q, sdev)) break; If I reduce HBA QD and make sure IO from above layer is throttled due to HBA QD, there is a same impact. MR driver use host wide shared tag map. Can someone help me if this can be tunable in LLD providing additional settings or it is expected behavior ? Problem I am facing is, I am not able to figure out optimal device queue depth for different configuration and work load. Thanks, Kashyap -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2 4/7] megaraid_sas: Send SYNCHRONIZE_CACHE command to firmware
> -Original Message- > From: James Bottomley [mailto:j...@linux.vnet.ibm.com] > Sent: Friday, October 21, 2016 4:32 PM > To: Sumit Saxena; linux-scsi@vger.kernel.org > Cc: martin.peter...@oracle.com; the...@redhat.com; > kashyap.de...@broadcom.com; sta...@vger.kernel.org > Subject: Re: [PATCH v2 4/7] megaraid_sas: Send SYNCHRONIZE_CACHE > command to firmware > > On Thu, 2016-10-20 at 02:05 -0700, Sumit Saxena wrote: > > From previous patch we have below changes in v2 - 1. Updated change > > log. Provided more detail in change log. > > 2. Agreed to remove module parameter. If we remove module parameter, > > we > > can ask customer to disable WCE on drive to get similar impact. > > 3. Always Send SYNCHRONIZE_CACHE for JBOD (non Raid) Device to > > Firmware. > > > > Current megaraid_sas driver returns SYNCHRONIZE_CACHE(related to Drive > > Cache) command back to SCSI layer without sending it down to > > firmware as firmware supposed to take care of flushing disk cache for > > all drives belongs to Virtual Disk at the time of system > > reboot/shutdown. > > > > We evaluate and understood that for Raid Volume, why driver should not > > send SYNC_CACHE command to the Firmware. > > There may be a some reason in past, but now it looks to be not > > required and we have fixed this issue as part of this patch. > > > > Commit- " 02b01e0 [SCSI] megaraid_sas: return sync cache call with > > success" > > added the code in driver to return SYNCHRONIZE_CACHE without sending > > it to firmware back in 2007. Earlier MR was mainly for Virtual Disk, > > so same code continue for JBOD as well whenever JBOD support was added > > and it introduced bug that SYNCHRONIZE_CACHE is not passed to FW for > > JBOD (non Raid disk). > > > > But our recent analysis indicates that, From Day-1 MR Firmware always > > expect Driver to forward SYNCHRONIZE_CACHE for JBOD (non Raid disk) to > > the Firmware. > > We have fixed this as part of this patch. > > > > - Additional background - > > There are some instance of MegaRaid FW (E.a Gen2 and Gen2.5 FW) set > > WCE bit for Virtual Disk but firmware does not reply correct status > > for SYNCH_CACHE. > > It is very difficult to find out which Device ID and firmware has > > capability to manage SYNC_CACHE, so we managed to figure out which are > > the new firmware (canHandleSyncCache is set in scratch pad register at > > 0xB4) and use that interface for correct behavior as explained above. > > > > E.g Liberator/Thunderbolt MegaRaid FW returns SYNC_CACHE as > > Unsupported command (this is a firmware bug) and eventually command > > will be failed to SML. > > This will cause File system to go Read-only. > > > > - New behavior described - > > > > IF application requests SYNCH_CACHE > >IF 'JBOD' > > Driver sends SYNCH_CACHE command to the FW > >FW sends SYNCH_CACHE to drive > >FW obtains status from drive and returns same status > > back to driver > >ELSEIF 'VirtualDisk' > >IF any FW which supports new API bit called > > canHandleSyncCache > > Driver sends SYNCH_CACHE command to the > > FW > > FW does not send SYNCH_CACHE to drives > > FW returns SUCCESS > >ELSE > > Driver does not send SYNCH_CACHE command > > to the FW. > > Driver return SUCCESS for that command. > >ENDIF > > ENDIF > > ENDIF > > > > CC: sta...@vger.kernel.org > > Can you please split this into two, so we can backport the bug fix without > any of > the feature addition junk? James - I am sending new patch making two logical fix as you requested. JBOD fix will be marked for CC:stable and for Virtual disk I will post only for head (not for stable). Let me know if resending complete series is good or resending this patch is better. > > > Signed-off-by: Kashyap Desai > > Signed-off-by: Sumit Saxena > > --- > > drivers/scsi/megaraid/megaraid_sas.h| 3 +++ > > drivers/scsi/megaraid/megaraid_sas_base.c | 10 ++ > > drivers/scsi/megaraid/megaraid_sas_fusion.c | 5 + > > 3 files changed, 10 insertions(+), 8 deletions(-) > > > > diff --git a/drivers/scsi/megaraid/megaraid_sas.h > > b/drivers/scsi/megaraid/megaraid_sas.h > > index ca86c88..43fd14f 100644 > > --- a/drivers/scsi/me
RE: Device or HBA level QD throttling creates randomness in sequetial workload
Hi - I found below conversation and it is on the same line as I wanted some input from mailing list. http://marc.info/?l=linux-kernel&m=147569860526197&w=2 I can do testing on any WIP item as Omar mentioned in above discussion. https://github.com/osandov/linux/tree/blk-mq-iosched Is there any workaround/alternative in latest upstream kernel, if user wants to see limited penalty for Sequential Work load on HDD ? ` Kashyap > -Original Message----- > From: Kashyap Desai [mailto:kashyap.de...@broadcom.com] > Sent: Thursday, October 20, 2016 3:39 PM > To: linux-scsi@vger.kernel.org > Subject: Device or HBA level QD throttling creates randomness in sequetial > workload > > [ Apologize, if you find more than one instance of my email. > Web based email client has some issue, so now trying git send mail.] > > Hi, > > I am doing some performance tuning in MR driver to understand how sdev > queue depth and hba queue depth play role in IO submission from above layer. > I have 24 JBOD connected to MR 12GB controller and I can see performance for > 4K Sequential work load as below. > > HBA QD for MR controller is 4065 and Per device QD is set to 32 > > queue depth from 256 reports 300K IOPS queue depth from 128 > reports 330K IOPS queue depth from 64 reports 360K IOPS queue depth > from 32 reports 510K IOPS > > In MR driver I added debug print and confirm that more IO come to driver as > random IO whenever I have queue depth more than 32. > > I have debug using scsi logging level and blktrace as well. Below is snippet of > logs using scsi logging level. In summary, if SML do flow control of IO due to > Device QD or HBA QD, IO coming to LLD is more random pattern. > > I see IO coming to driver is not sequential. > > [79546.912041] sd 18:2:21:0: [sdy] tag#854 CDB: Write(10) 2a 00 00 03 c0 3b 00 > 00 01 00 [79546.912049] sd 18:2:21:0: [sdy] tag#855 CDB: Write(10) 2a 00 00 03 > c0 3c 00 00 01 00 [79546.912053] sd 18:2:21:0: [sdy] tag#886 CDB: Write(10) 2a > 00 00 03 c0 5b 00 00 01 00 > > After LBA "00 03 c0 3c" next command is with LBA "00 03 c0 5b". > Two Sequence are overlapped due to sdev QD throttling. > > [79546.912056] sd 18:2:21:0: [sdy] tag#887 CDB: Write(10) 2a 00 00 03 c0 5c 00 > 00 01 00 [79546.912250] sd 18:2:21:0: [sdy] tag#856 CDB: Write(10) 2a 00 00 03 > c0 3d 00 00 01 00 [79546.912257] sd 18:2:21:0: [sdy] tag#888 CDB: Write(10) 2a > 00 00 03 c0 5d 00 00 01 00 [79546.912259] sd 18:2:21:0: [sdy] tag#857 CDB: > Write(10) 2a 00 00 03 c0 3e 00 00 01 00 [79546.912268] sd 18:2:21:0: [sdy] > tag#858 CDB: Write(10) 2a 00 00 03 c0 3f 00 00 01 00 > > If scsi_request_fn() breaks due to unavailability of device queue (due to below > check), will there be any side defect as I observe ? > if (!scsi_dev_queue_ready(q, sdev)) > break; > > If I reduce HBA QD and make sure IO from above layer is throttled due to HBA > QD, there is a same impact. > MR driver use host wide shared tag map. > > Can someone help me if this can be tunable in LLD providing additional settings > or it is expected behavior ? Problem I am facing is, I am not able to figure out > optimal device queue depth for different configuration and work load. > > Thanks, Kashyap -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/7] megaraid_sas: Updates for scsi-next
Changes in v3 two logical patches created for "Send SYNCHRONIZE_CACHE command to firmware" - Send SYNCHRONIZE_CACHE for JBOD to firmware - Send SYNCHRONIZE_CACHE for VD to firmware Changes in v2: 1. Removed unconditional msleep and moved calls to atomic_read into megasas_wait_for_adapter_operational 2. Updated change log for patch #4. Provided more detail in change log. 3. Agreed to remove module parameter. If we remove module parameter, we can ask customer to disable WCE on drive to get similar impact. 4. Always Send SYNCHRONIZE_CACHE for JBOD (non Raid) Device to Firmware. 5. Add log message printing the state of FW sync cache support 6. Moved version update patch to end of series Sumit Saxena (7): megaraid_sas: For SRIOV enabled firmware, ensure VF driver waits for 30secs before reset megaraid_sas: Send correct PhysArm to FW for R1 VD downgrade megaraid_sas: Do not fire DCMDs during PCI shutdown/detach megaraid_sas: Send SYNCHRONIZE_CACHE for JBOD to firmware megaraid_sas: Send SYNCHRONIZE_CACHE for VD to firmware MAINTAINERS: Update megaraid maintainers list megaraid_sas: Do not set MPI2_TYPE_CUDA for JBOD FP path for FW which does not support JBOD sequence map megaraid_sas: driver version upgrade MAINTAINERS | 10 +++--- drivers/scsi/megaraid/megaraid_sas.h| 7 +++-- drivers/scsi/megaraid/megaraid_sas_base.c | 49 - drivers/scsi/megaraid/megaraid_sas_fp.c | 6 ++-- drivers/scsi/megaraid/megaraid_sas_fusion.c | 23 +- 5 files changed, 71 insertions(+), 24 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/8] megaraid_sas: Send correct PhysArm to FW for R1 VD downgrade
This patch fixes the issue of wrong PhysArm was sent to firmware for R1 VD downgrade. Signed-off-by: Kiran Kumar Kasturi Signed-off-by: Sumit Saxena Reviewed-by: Hannes Reinecke Reviewed-by: Tomas Henzl --- drivers/scsi/megaraid/megaraid_sas_fp.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/megaraid/megaraid_sas_fp.c b/drivers/scsi/megaraid/megaraid_sas_fp.c index e413113..f237d00 100644 --- a/drivers/scsi/megaraid/megaraid_sas_fp.c +++ b/drivers/scsi/megaraid/megaraid_sas_fp.c @@ -782,7 +782,8 @@ static u8 mr_spanset_get_phy_params(struct megasas_instance *instance, u32 ld, (raid->regTypeReqOnRead != REGION_TYPE_UNUSED pRAID_Context->regLockFlags = REGION_TYPE_EXCLUSIVE; else if (raid->level == 1) { - pd = MR_ArPdGet(arRef, physArm + 1, map); + physArm = physArm + 1; + pd = MR_ArPdGet(arRef, physArm, map); if (pd != MR_PD_INVALID) *pDevHandle = MR_PdDevHandleGet(pd, map); } @@ -879,7 +880,8 @@ u8 MR_GetPhyParams(struct megasas_instance *instance, u32 ld, u64 stripRow, pRAID_Context->regLockFlags = REGION_TYPE_EXCLUSIVE; else if (raid->level == 1) { /* Get alternate Pd. */ - pd = MR_ArPdGet(arRef, physArm + 1, map); + physArm = physArm + 1; + pd = MR_ArPdGet(arRef, physArm, map); if (pd != MR_PD_INVALID) /* Get dev handle from Pd */ *pDevHandle = MR_PdDevHandleGet(pd, map); -- 2.4.11 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 3/8] megaraid_sas: Do not fire DCMDs during PCI shutdown/detach
This patch addresses the issue of driver firing DCMDs in PCI shutdown/detach path irrespective of firmware state. Driver will check for whether firmware is operational state or not before firing DCMDs. If firmware is in unrecoverbale state or does not become operational within specfied time, driver will skip firing DCMDs. Signed-off-by: Sumit Saxena Signed-off-by: Shivasharan Srikanteshwara --- drivers/scsi/megaraid/megaraid_sas_base.c | 39 + drivers/scsi/megaraid/megaraid_sas_fusion.c | 9 --- 2 files changed, 45 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c index c3efcc7..ba57be6 100644 --- a/drivers/scsi/megaraid/megaraid_sas_base.c +++ b/drivers/scsi/megaraid/megaraid_sas_base.c @@ -6248,6 +6248,34 @@ fail_reenable_msix: #define megasas_resume NULL #endif +static inline int +megasas_wait_for_adapter_operational(struct megasas_instance *instance) +{ + int wait_time = MEGASAS_RESET_WAIT_TIME * 2; + int i; + + if (atomic_read(&instance->adprecovery) == MEGASAS_HW_CRITICAL_ERROR) + return 1; + + for (i = 0; i < wait_time; i++) { + if (atomic_read(&instance->adprecovery) == MEGASAS_HBA_OPERATIONAL) + break; + + if (!(i % MEGASAS_RESET_NOTICE_INTERVAL)) + dev_notice(&instance->pdev->dev, "waiting for controller reset to finish\n"); + + msleep(1000); + } + + if (atomic_read(&instance->adprecovery) != MEGASAS_HBA_OPERATIONAL) { + dev_info(&instance->pdev->dev, "%s timed out while waiting for HBA to recover.\n", + __func__); + return 1; + } + + return 0; +} + /** * megasas_detach_one -PCI hot"un"plug entry point * @pdev: PCI device structure @@ -6272,9 +6300,14 @@ static void megasas_detach_one(struct pci_dev *pdev) if (instance->fw_crash_state != UNAVAILABLE) megasas_free_host_crash_buffer(instance); scsi_remove_host(instance->host); + + if (megasas_wait_for_adapter_operational(instance)) + goto skip_firing_dcmds; + megasas_flush_cache(instance); megasas_shutdown_controller(instance, MR_DCMD_CTRL_SHUTDOWN); +skip_firing_dcmds: /* cancel the delayed work if this work still in queue*/ if (instance->ev != NULL) { struct megasas_aen_event *ev = instance->ev; @@ -6388,8 +6421,14 @@ static void megasas_shutdown(struct pci_dev *pdev) struct megasas_instance *instance = pci_get_drvdata(pdev); instance->unload = 1; + + if (megasas_wait_for_adapter_operational(instance)) + goto skip_firing_dcmds; + megasas_flush_cache(instance); megasas_shutdown_controller(instance, MR_DCMD_CTRL_SHUTDOWN); + +skip_firing_dcmds: instance->instancet->disable_intr(instance); megasas_destroy_irqs(instance); diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c index 0de7de3..ec626fc 100644 --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c @@ -2463,12 +2463,15 @@ irqreturn_t megasas_isr_fusion(int irq, void *devp) /* Start collecting crash, if DMA bit is done */ if ((fw_state == MFI_STATE_FAULT) && dma_state) schedule_work(&instance->crash_init); - else if (fw_state == MFI_STATE_FAULT) - schedule_work(&instance->work_init); + else if (fw_state == MFI_STATE_FAULT) { + if (instance->unload == 0) + schedule_work(&instance->work_init); + } } else if (fw_state == MFI_STATE_FAULT) { dev_warn(&instance->pdev->dev, "Iop2SysDoorbellInt" "for scsi%d\n", instance->host->host_no); - schedule_work(&instance->work_init); + if (instance->unload == 0) + schedule_work(&instance->work_init); } } -- 2.4.11 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/8] megaraid_sas: For SRIOV enabled firmware, ensure VF driver waits for 30secs before reset
For SRIOV enabled firmware, if there is a OCR(online controller reset) possibility driver set the convert flag to 1, which is not happening if there are outstanding commands even after 180 seconds. As driver does not set convert flag to 1 and still making the OCR to run, VF(Virtual function) driver is directly writing on to the register instead of waiting for 30 seconds. Setting convert flag to 1 will cause VF driver will wait for 30 secs before going for reset. CC: sta...@vger.kernel.org Signed-off-by: Kiran Kumar Kasturi Signed-off-by: Sumit Saxena Reviewed-by: Hannes Reinecke Reviewed-by: Tomas Henzl --- drivers/scsi/megaraid/megaraid_sas_fusion.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c index ec83754..0de7de3 100644 --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c @@ -2823,6 +2823,7 @@ int megasas_wait_for_outstanding_fusion(struct megasas_instance *instance, dev_err(&instance->pdev->dev, "pending commands remain after waiting, " "will reset adapter scsi%d.\n", instance->host->host_no); + *convert = 1; retval = 1; } out: -- 2.4.11 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 4/8] megaraid_sas: Send SYNCHRONIZE_CACHE for non-raid to firmware
Commit- " 02b01e0 [SCSI] megaraid_sas: return sync cache call with success" added the code in driver to return SYNCHRONIZE_CACHE without sending it to firmware back in 2007. Earlier MR was mainly for Virtual Disk, so same code continue for JBOD as well whenever JBOD support was added and it introduced bug that SYNCHRONIZE_CACHE is not passed to FW for JBOD (non Raid disk). But our recent analysis indicates that, From Day-1 MR Firmware always expect Driver to forward SYNCHRONIZE_CACHE for JBOD (non Raid disk) to the Firmware. We have fixed this as part of this patch. CC: sta...@vger.kernel.org Signed-off-by: Kashyap Desai Signed-off-by: Sumit Saxena --- drivers/scsi/megaraid/megaraid_sas_base.c | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c index ba57be6..c98d4f9 100644 --- a/drivers/scsi/megaraid/megaraid_sas_base.c +++ b/drivers/scsi/megaraid/megaraid_sas_base.c @@ -1700,16 +1700,13 @@ megasas_queue_command(struct Scsi_Host *shost, struct scsi_cmnd *scmd) goto out_done; } - switch (scmd->cmnd[0]) { - case SYNCHRONIZE_CACHE: - /* -* FW takes care of flush cache on its own -* No need to send it down -*/ + /* +* FW takes care of flush cache on its own for Virtual Disk. +* No need to send it down for VD. For JBOD send SYNCHRONIZE_CACHE to FW. +*/ + if ((scmd->cmnd[0] == SYNCHRONIZE_CACHE) && MEGASAS_IS_LOGICAL(scmd)) { scmd->result = DID_OK << 16; goto out_done; - default: - break; } return instance->instancet->build_and_issue_cmd(instance, scmd); -- 2.4.11 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 5/8] megaraid_sas: Send SYNCHRONIZE_CACHE for VD to firmware
>From previous patch we have below changes in v2 (only for Virtual Disk)- 1. Updated change log. Provided more detail in change log. 2. Agreed to remove module parameter. If we remove module parameter, we can ask customer to disable WCE on drive to get similar impact. 3. Always Send SYNCHRONIZE_CACHE for JBOD (non Raid) Device to Firmware. Current megaraid_sas driver returns SYNCHRONIZE_CACHE(related to Drive Cache) command back to SCSI layer without sending it down to firmware as firmware supposed to take care of flushing disk cache for all drives belongs to Virtual Disk at the time of system reboot/shutdown. We evaluate and understood that for Raid Volume, why driver should not send SYNC_CACHE command to the Firmware. There may be a some reason in past, but now it looks to be not required and we have fixed this issue as part of this patch. - Additional background - There are some instance of MegaRaid FW (E.a Gen2 and Gen2.5 FW) set WCE bit for Virtual Disk but firmware does not reply correct status for SYNCH_CACHE. It is very difficult to find out which Device ID and firmware has capability to manage SYNC_CACHE, so we managed to figure out which are the new firmware (canHandleSyncCache is set in scratch pad register at 0xB4) and use that interface for correct behavior as explained above. E.g Liberator/Thunderbolt MegaRaid FW returns SYNC_CACHE as Unsupported command (this is a firmware bug) and eventually command will be failed to SML. This will cause File system to go Read-only. - New behavior described - IF application requests SYNCH_CACHE IF 'JBOD' Driver sends SYNCH_CACHE command to the FW FW sends SYNCH_CACHE to drive FW obtains status from drive and returns same status back to driver ELSEIF 'VirtualDisk' IF any FW which supports new API bit called canHandleSyncCache Driver sends SYNCH_CACHE command to the FW FW does not send SYNCH_CACHE to drives FW returns SUCCESS ELSE Driver does not send SYNCH_CACHE command to the FW. Driver return SUCCESS for that command. ENDIF ENDIF ENDIF Signed-off-by: Kashyap Desai Signed-off-by: Sumit Saxena --- drivers/scsi/megaraid/megaraid_sas.h| 3 +++ drivers/scsi/megaraid/megaraid_sas_base.c | 7 ++- drivers/scsi/megaraid/megaraid_sas_fusion.c | 5 + 3 files changed, 10 insertions(+), 5 deletions(-) diff --git a/drivers/scsi/megaraid/megaraid_sas.h b/drivers/scsi/megaraid/megaraid_sas.h index ca86c88..43fd14f 100644 --- a/drivers/scsi/megaraid/megaraid_sas.h +++ b/drivers/scsi/megaraid/megaraid_sas.h @@ -1429,6 +1429,8 @@ enum FW_BOOT_CONTEXT { #define MR_MAX_REPLY_QUEUES_EXT_OFFSET_SHIFT14 #define MR_MAX_MSIX_REG_ARRAY 16 #define MR_RDPQ_MODE_OFFSET0X0080 +#define MR_CAN_HANDLE_SYNC_CACHE_OFFSET0X0100 + /* * register set for both 1068 and 1078 controllers * structure extended for 1078 registers @@ -2140,6 +2142,7 @@ struct megasas_instance { u8 is_imr; u8 is_rdpq; bool dev_handle; + bool fw_sync_cache_support; }; struct MR_LD_VF_MAP { u32 size; diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c index c98d4f9..236b8ed 100644 --- a/drivers/scsi/megaraid/megaraid_sas_base.c +++ b/drivers/scsi/megaraid/megaraid_sas_base.c @@ -1700,11 +1700,8 @@ megasas_queue_command(struct Scsi_Host *shost, struct scsi_cmnd *scmd) goto out_done; } - /* -* FW takes care of flush cache on its own for Virtual Disk. -* No need to send it down for VD. For JBOD send SYNCHRONIZE_CACHE to FW. -*/ - if ((scmd->cmnd[0] == SYNCHRONIZE_CACHE) && MEGASAS_IS_LOGICAL(scmd)) { + if ((scmd->cmnd[0] == SYNCHRONIZE_CACHE) && MEGASAS_IS_LOGICAL(scmd) && + (!instance->fw_sync_cache_support)) { scmd->result = DID_OK << 16; goto out_done; } diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c index ec626fc..0edeeb1 100644 --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c @@ -748,6 +748,11 @@ megasas_ioc_init_fusion(struct megasas_instance *instance) goto fail_fw_init; } + instance->fw_sync_cache_support = (scratch_pad_2 & + MR_CAN_HANDLE_SYNC_CACHE_OFFSET) ? 1 : 0; + dev_info(&instance->pdev->dev, "FW supports sync cache\t: %s\n", +instance->fw_sync_cache_support ? "Yes" : "No"); + IOCInitMessage = dma_alloc_coherent(&instance->pdev-&
[PATCH v3 6/8] MAINTAINERS: Update megaraid maintainers list
Update MEGARAID drivers maintainers list. Signed-off-by: Sumit Saxena Reviewed-by: Hannes Reinecke --- MAINTAINERS | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/MAINTAINERS b/MAINTAINERS index f0ee7a6..8b9117f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7612,12 +7612,12 @@ S: Maintained F: drivers/net/wireless/mediatek/mt7601u/ MEGARAID SCSI/SAS DRIVERS -M: Kashyap Desai -M: Sumit Saxena -M: Uday Lingala -L: megaraidlinux@avagotech.com +M: Kashyap Desai +M: Sumit Saxena +M: Shivasharan S +L: megaraidlinux@broadcom.com L: linux-scsi@vger.kernel.org -W: http://www.lsi.com +W: http://www.avagotech.com/support/ S: Maintained F: Documentation/scsi/megaraid.txt F: drivers/scsi/megaraid.* -- 2.4.11 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 8/8] megaraid_sas: driver version upgrade
Signed-off-by: Sumit Saxena --- drivers/scsi/megaraid/megaraid_sas.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/megaraid/megaraid_sas.h b/drivers/scsi/megaraid/megaraid_sas.h index 43fd14f..74c7b44 100644 --- a/drivers/scsi/megaraid/megaraid_sas.h +++ b/drivers/scsi/megaraid/megaraid_sas.h @@ -35,8 +35,8 @@ /* * MegaRAID SAS Driver meta data */ -#define MEGASAS_VERSION"06.811.02.00-rc1" -#define MEGASAS_RELDATE"April 12, 2016" +#define MEGASAS_VERSION"06.812.07.00-rc1" +#define MEGASAS_RELDATE"August 22, 2016" /* * Device IDs -- 2.4.11 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 7/8] megaraid_sas: Do not set MPI2_TYPE_CUDA for JBOD FP path for FW which does not support JBOD sequence map
CC: sta...@vger.kernel.org Signed-off-by: Sumit Saxena Reviewed-by: Hannes Reinecke Reviewed-by: Tomas Henzl --- drivers/scsi/megaraid/megaraid_sas_fusion.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c index 0edeeb1..1cf0e49 100644 --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c @@ -2005,6 +2005,8 @@ megasas_build_syspd_fusion(struct megasas_instance *instance, io_request->DevHandle = pd_sync->seq[pd_index].devHandle; pRAID_Context->regLockFlags |= (MR_RL_FLAGS_SEQ_NUM_ENABLE|MR_RL_FLAGS_GRANT_DESTINATION_CUDA); + pRAID_Context->Type = MPI2_TYPE_CUDA; + pRAID_Context->nseg = 0x1; } else if (fusion->fast_path_io) { pRAID_Context->VirtualDiskTgtId = cpu_to_le16(device_id); pRAID_Context->configSeqNum = 0; @@ -2040,12 +2042,10 @@ megasas_build_syspd_fusion(struct megasas_instance *instance, pRAID_Context->timeoutValue = cpu_to_le16((os_timeout_value > timeout_limit) ? timeout_limit : os_timeout_value); - if (fusion->adapter_type == INVADER_SERIES) { - pRAID_Context->Type = MPI2_TYPE_CUDA; - pRAID_Context->nseg = 0x1; + if (fusion->adapter_type == INVADER_SERIES) io_request->IoFlags |= cpu_to_le16(MPI25_SAS_DEVICE0_FLAGS_ENABLED_FAST_PATH); - } + cmd->request_desc->SCSIIO.RequestFlags = (MPI2_REQ_DESCRIPT_FLAGS_FP_IO << MEGASAS_REQ_DESCRIPT_FLAGS_TYPE_SHIFT); -- 2.4.11 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2 4/7] megaraid_sas: Send SYNCHRONIZE_CACHE command to firmware
> > > > > > Can you please split this into two, so we can backport the bug fix > > > without any of the feature addition junk? > > > > James - I am sending new patch making two logical fix as you > > requested. JBOD fix will be marked for CC:stable and for Virtual disk > > I will post only for head (not for stable). Let me know if resending > > complete series is good or resending this patch is better. > > This is up to Martin, because he has to sort it out, but I think we can > cope with > just the two patches. The bug fix one has to go into the fixes branch for > immediate application and the rest of the series will go into the misc > branch for > the 4.10 merge window. I resend patch set ( v3 series) . Martin, please use latest resend v3 for 4.10 merge. Need Reviewed-by tag for below patch as reworked from original submission. [PATCH v3 3/8] megaraid_sas: Do not fire DCMDs during PCI shutdown/detach [PATCH v3 4/8] megaraid_sas: Send SYNCHRONIZE_CACHE for non-raid to firmware [PATCH v3 5/8] megaraid_sas: Send SYNCHRONIZE_CACHE for VD to firmware > > James > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Device or HBA level QD throttling creates randomness in sequetial workload
> > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote: > > Hi - > > > > I found below conversation and it is on the same line as I wanted some > > input from mailing list. > > > > http://marc.info/?l=linux-kernel&m=147569860526197&w=2 > > > > I can do testing on any WIP item as Omar mentioned in above discussion. > > https://github.com/osandov/linux/tree/blk-mq-iosched I tried build kernel using this repo, but looks like it is not allowed to reboot due to some changes in layer. > > Are you using blk-mq for this disk? If not, then the work there won't affect you. YES. I am using blk-mq for my test. I also confirm if use_blk_mq is disable, Sequential work load issue is not seen and scheduling works well. > > > Is there any workaround/alternative in latest upstream kernel, if user > > wants to see limited penalty for Sequential Work load on HDD ? > > > > ` Kashyap > > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Device or HBA level QD throttling creates randomness in sequetial workload
> -Original Message- > From: Omar Sandoval [mailto:osan...@osandov.com] > Sent: Monday, October 24, 2016 9:11 PM > To: Kashyap Desai > Cc: linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org; linux- > bl...@vger.kernel.org; ax...@kernel.dk; Christoph Hellwig; > paolo.vale...@linaro.org > Subject: Re: Device or HBA level QD throttling creates randomness in sequetial > workload > > On Mon, Oct 24, 2016 at 06:35:01PM +0530, Kashyap Desai wrote: > > > > > > On Fri, Oct 21, 2016 at 05:43:35PM +0530, Kashyap Desai wrote: > > > > Hi - > > > > > > > > I found below conversation and it is on the same line as I wanted > > > > some input from mailing list. > > > > > > > > http://marc.info/?l=linux-kernel&m=147569860526197&w=2 > > > > > > > > I can do testing on any WIP item as Omar mentioned in above > > discussion. > > > > https://github.com/osandov/linux/tree/blk-mq-iosched > > > > I tried build kernel using this repo, but looks like it is not allowed > > to reboot due to some changes in layer. > > Did you build the most up-to-date version of that branch? I've been force > pushing to it, so the commit id that you built would be useful. > What boot failure are you seeing? Below is latest commit on repo. commit b077a9a5149f17ccdaa86bc6346fa256e3c1feda Author: Omar Sandoval Date: Tue Sep 20 11:20:03 2016 -0700 [WIP] blk-mq: limit bio queue depth I have latest repo from 4.9/scsi-next maintained by Martin which boots fine. Only Delta is " CONFIG_SBITMAP" is enabled in WIP blk-mq-iosched branch. I could not see any meaningful data on boot hang, so going to try one more time tomorrow. > > > > > > > Are you using blk-mq for this disk? If not, then the work there > > > won't > > affect you. > > > > YES. I am using blk-mq for my test. I also confirm if use_blk_mq is > > disable, Sequential work load issue is not seen and scheduling > > works well. > > Ah, okay, perfect. Can you send the fio job file you're using? Hard to tell exactly > what's going on without the details. A sequential workload with just one > submitter is about as easy as it gets, so this _should_ be behaving nicely. ; setup numa policy for each thread ; 'numactl --show' to determine the maximum numa nodes [global] ioengine=libaio buffered=0 rw=write bssplit=4K/100 iodepth=256 numjobs=1 direct=1 runtime=60s allow_mounted_write=0 [job1] filename=/dev/sdd .. [job24] filename=/dev/sdaa When I tune /sys/module/scsi_mod/parameters/use_blk_mq = 1, below is a ioscheduler detail. (It is in blk-mq mode. ) /sys/devices/pci:00/:00:02.0/:02:00.0/host10/target10:2:13/10: 2:13:0/block/sdq/queue/scheduler:none When I have set /sys/module/scsi_mod/parameters/use_blk_mq = 0, ioscheduler picked by SML is . /sys/devices/pci:00/:00:02.0/:02:00.0/host10/target10:2:13/10: 2:13:0/block/sdq/queue/scheduler:noop deadline [cfq] I see in blk-mq performance is very low for Sequential Write work load and I confirm that blk-mq convert Sequential work load into random stream due to io-scheduler change in blk-mq vs legacy block layer. > > > > > > > > Is there any workaround/alternative in latest upstream kernel, if > > > > user wants to see limited penalty for Sequential Work load on HDD ? > > > > > > > > ` Kashyap > > > > > > P.S., your emails are being marked as spam by Gmail. Actually, Gmail seems to > mark just about everything I get from Broadcom as spam due to failed DMARC. > > -- > Omar -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Device or HBA level QD throttling creates randomness in sequetial workload
Jens- Replied inline. Omar - I tested your WIP repo and figure out System hangs only if I pass " scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I am looking for scsi_mod.use_blk_mq=Y. Also below is snippet of blktrace. In case of higher per device QD, I see Requeue request in blktrace. 65,128 10 6268 2.432404509 18594 P N [fio] 65,128 10 6269 2.432405013 18594 U N [fio] 1 65,128 10 6270 2.432405143 18594 I WS 148800 + 8 [fio] 65,128 10 6271 2.432405740 18594 R WS 148800 + 8 [0] 65,128 10 6272 2.432409794 18594 Q WS 148808 + 8 [fio] 65,128 10 6273 2.432410234 18594 G WS 148808 + 8 [fio] 65,128 10 6274 2.432410424 18594 S WS 148808 + 8 [fio] 65,128 23 3626 2.432432595 16232 D WS 148800 + 8 [kworker/23:1H] 65,128 22 3279 2.432973482 0 C WS 147432 + 8 [0] 65,128 7 6126 2.433032637 18594 P N [fio] 65,128 7 6127 2.433033204 18594 U N [fio] 1 65,128 7 6128 2.433033346 18594 I WS 148808 + 8 [fio] 65,128 7 6129 2.433033871 18594 D WS 148808 + 8 [fio] 65,128 7 6130 2.433034559 18594 R WS 148808 + 8 [0] 65,128 7 6131 2.433039796 18594 Q WS 148816 + 8 [fio] 65,128 7 6132 2.433040206 18594 G WS 148816 + 8 [fio] 65,128 7 6133 2.433040351 18594 S WS 148816 + 8 [fio] 65,128 9 6392 2.433133729 0 C WS 147240 + 8 [0] 65,128 9 6393 2.433138166 905 D WS 148808 + 8 [kworker/9:1H] 65,128 7 6134 2.433167450 18594 P N [fio] 65,128 7 6135 2.433167911 18594 U N [fio] 1 65,128 7 6136 2.433168074 18594 I WS 148816 + 8 [fio] 65,128 7 6137 2.433168492 18594 D WS 148816 + 8 [fio] 65,128 7 6138 2.433174016 18594 Q WS 148824 + 8 [fio] 65,128 7 6139 2.433174282 18594 G WS 148824 + 8 [fio] 65,128 7 6140 2.433174613 18594 S WS 148824 + 8 [fio] CPU0 (sdy): Reads Queued: 0,0KiB Writes Queued: 79, 316KiB Read Dispatches:0,0KiB Write Dispatches: 67, 18,446,744,073PiB Reads Requeued: 0 Writes Requeued:86 Reads Completed:0,0KiB Writes Completed: 98, 392KiB Read Merges:0,0KiB Write Merges:0, 0KiB Read depth: 0 Write depth: 5 IO unplugs:79 Timer unplugs: 0 ` Kashyap > -Original Message- > From: Jens Axboe [mailto:ax...@kernel.dk] > Sent: Monday, October 31, 2016 10:54 PM > To: Kashyap Desai; Omar Sandoval > Cc: linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org; linux- > bl...@vger.kernel.org; Christoph Hellwig; paolo.vale...@linaro.org > Subject: Re: Device or HBA level QD throttling creates randomness in > sequetial > workload > > Hi, > > One guess would be that this isn't around a requeue condition, but rather > the > fact that we don't really guarantee any sort of hard FIFO behavior between > the > software queues. Can you try this test patch to see if it changes the > behavior for > you? Warning: untested... Jens - I tested the patch, but I still see random IO pattern for expected Sequential Run. I am intentionally running case of Re-queue and seeing issue at the time of Re-queue. If there is no Requeue, I see no issue at LLD. > > diff --git a/block/blk-mq.c b/block/blk-mq.c index > f3d27a6dee09..5404ca9c71b2 > 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -772,6 +772,14 @@ static inline unsigned int queued_to_index(unsigned > int > queued) > return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1); > } > > +static int rq_pos_cmp(void *priv, struct list_head *a, struct list_head > +*b) { > + struct request *rqa = container_of(a, struct request, queuelist); > + struct request *rqb = container_of(b, struct request, queuelist); > + > + return blk_rq_pos(rqa) < blk_rq_pos(rqb); } > + > /* >* Run this hardware queue, pulling any software queues mapped to it in. >* Note that this function currently has various problems around > ordering @@ - > 812,6 +820,14 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx > *hctx) > } > > /* > + * If the device is rotational, sort the list sanely to avoid > + * unecessary seeks. The software queues are roughly FIFO, but > + * only roughly, there are no hard guarantees. > + */ > + if (!blk_queue_nonrot(q)) > + list_sort(NULL, &rq_list, rq_pos_cmp); > + > + /* >* Start off with dptr being NULL, so we start the first request >* immediately, even if we have more pending. >*/ > > -- > Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [REGRESSION] 4.9-rc4+ doesn't boot on my test box
> -Original Message- > From: Martin K. Petersen [mailto:martin.peter...@oracle.com] > Sent: Wednesday, November 09, 2016 4:45 AM > To: Jens Axboe > Cc: linux-scsi; Kashyap Desai; Martin K. Petersen > Subject: Re: [REGRESSION] 4.9-rc4+ doesn't boot on my test box > > >>>>> "Jens" == Jens Axboe writes: > > Jens> I wasted half a day on this, thinking it was something in my > Jens> 4.10 branches. But it turns out it is not, the regression is in > Jens> mainline. Jens - Sorry for trouble. I did not validated this single patch. I validated complete patch set. Issue is improper MACRO usage MEGASAS_IS_LOGICAL, which gives incorrect check condition in qcmd Path. Below is proposed fix. diff --git a/drivers/scsi/megaraid/megaraid_sas.h b/drivers/scsi/megaraid/megaraid_sas.h index 74c7b44..0d2625b 100644 --- a/drivers/scsi/megaraid/megaraid_sas.h +++ b/drivers/scsi/megaraid/megaraid_sas.h @@ -2236,7 +2236,7 @@ struct megasas_instance_template { }; #define MEGASAS_IS_LOGICAL(scp) \ - (scp->device->channel < MEGASAS_MAX_PD_CHANNELS) ? 0 : 1 + ((scp->device->channel < MEGASAS_MAX_PD_CHANNELS) ? 0 : 1) #define MEGASAS_DEV_INDEX(scp) \ (((scp->device->channel % 2) * MEGASAS_MAX_DEV_PER_CHANNEL) + \ > > Kashyap, have you tested the stable fix without the remainder of the driver > update in place? Martin - I validated whole series. Apologies for this. Please help me to know how to fix this ? Do I need to send only fix on top of latest commit (as posted above - MACRO definition) for this issue ? > > -- > Martin K. PetersenOracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [REGRESSION] 4.9-rc4+ doesn't boot on my test box
> -Original Message- > From: Jens Axboe [mailto:ax...@kernel.dk] > Sent: Wednesday, November 09, 2016 6:50 AM > To: Kashyap Desai; Martin K. Petersen > Cc: linux-scsi > Subject: Re: [REGRESSION] 4.9-rc4+ doesn't boot on my test box > > On 11/08/2016 05:20 PM, Kashyap Desai wrote: > >> -Original Message- > >> From: Martin K. Petersen [mailto:martin.peter...@oracle.com] > >> Sent: Wednesday, November 09, 2016 4:45 AM > >> To: Jens Axboe > >> Cc: linux-scsi; Kashyap Desai; Martin K. Petersen > >> Subject: Re: [REGRESSION] 4.9-rc4+ doesn't boot on my test box > >> > >>>>>>> "Jens" == Jens Axboe writes: > >> > >> Jens> I wasted half a day on this, thinking it was something in my > >> Jens> 4.10 branches. But it turns out it is not, the regression is in > >> Jens> mainline. > > > > Jens - > > > > Sorry for trouble. I did not validated this single patch. I validated > > complete patch set. > > Issue is improper MACRO usage MEGASAS_IS_LOGICAL, which gives > > incorrect check condition in qcmd Path. > > > > Below is proposed fix. > > > > > > diff --git a/drivers/scsi/megaraid/megaraid_sas.h > > b/drivers/scsi/megaraid/megaraid_sas.h > > index 74c7b44..0d2625b 100644 > > --- a/drivers/scsi/megaraid/megaraid_sas.h > > +++ b/drivers/scsi/megaraid/megaraid_sas.h > > @@ -2236,7 +2236,7 @@ struct megasas_instance_template { }; > > > > #define MEGASAS_IS_LOGICAL(scp) > > \ > > - (scp->device->channel < MEGASAS_MAX_PD_CHANNELS) ? 0 : 1 > > Ugh... So we're completing everything immediately. > > > Martin - I validated whole series. Apologies for this. > > > > Please help me to know how to fix this ? Do I need to send only fix on > > top of latest commit (as posted above - MACRO definition) for this issue > > ? > > Send a fix on top of current -git asap. The current tree is completely > broken for > any megaraid user. -rc4 is no time to send in untested patches, especially > not > something that claims to fix a 9 year old bug and is marked for stable as > well. I will run some more test (using patch set only marked for stable from last series) and submit fix ASAP. > > -- > Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Delivery Status Notification for linuxr...@lsi.com
> -Original Message- > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > ow...@vger.kernel.org] On Behalf Of Paul Menzel > Sent: Thursday, November 10, 2016 7:40 PM > To: Martin K. Petersen > Cc: linux-scsi@vger.kernel.org; James E.J. Bottomley > Subject: Re: Delivery Status Notification for linuxr...@lsi.com > > Dear Martin, > > > On 11/10/16 15:07, Martin K. Petersen wrote: > >> "Paul" == Paul Menzel writes: > > > Paul> Probably you know it already, but the listed email address of > > Paul> the 3WARE SCSI drivers maintainer linuxr...@lsi.com doesn’t work > > Paul> (for me). > > > > Ownership of these products is now with Broadcom. To my knowledge the > > 3ware product lines have been discontinued. > > Indeed. I forgot to actually formulate the intend of my message. > > What should happen to the entry in the file `MAINTAINERS`? 3Ware product is discontinue from support point of view. We are discussing with LSI/Broadcom management team to clean up this part and provide more clarity. Please wait for reply as we are waiting for internal communication. > > > Kind regards, > > Paul > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of > a message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] Fix: scsi: megaraid: reduce the scope of pending-list lock to avoid double lock
> -Original Message- > From: i...@itu.dk [mailto:i...@itu.dk] > Sent: Monday, October 17, 2016 1:00 PM > To: Jiri Kosina > Cc: Kashyap Desai; Sumit Saxena; Uday Lingala; James E.J. Bottomley; Martin K. > Petersen; megaraidlinux@avagotech.com; linux-scsi@vger.kernel.org; Iago > Abal > Subject: [PATCH] Fix: scsi: megaraid: reduce the scope of pending-list lock to > avoid double lock > > From: Iago Abal > > The EBA code analyzer (https://github.com/models-team/eba) reported the > following double lock: > > 1. In function `megaraid_reset_handler' at 2571; > 2. take `&adapter->pend_list_lock' for the first time at 2602: > >// FIRST >spin_lock_irqsave(PENDING_LIST_LOCK(adapter), flags); > > 3. enter the `list_for_each_entry_safe' loop at 2603; > 4. call `megaraid_mbox_mm_done' at 2616; > 5. call `megaraid_mbox_runpendq' at 3782; > 6. take `&adapter->pend_list_lock' for the second time at 1892: > >// SECOND: DOUBLE LOCK !!! >spin_lock_irqsave(PENDING_LIST_LOCK(adapter), flags); > > From my shallow understanding of the code (so please review carefully), I think > that it is not necessary to hold `PENDING_LIST_LOCK(adapter)' while executing > the body of the `list_for_each_entry_safe' loop. I assume this because both > `megaraid_mbox_mm_done' and `megaraid_dealloc_scb' are called from > several places where, as far as I can tell, this lock is not hold. In fact, as reported > by EBA, at some point `megaraid_mbox_mm_done' will acquire this lock again. > > Fixes: c005fb4fb2d2 ("[SCSI] megaraid_{mm,mbox}: fix a bug in reset handler") > Signed-off-by: Iago Abal > --- > drivers/scsi/megaraid/megaraid_mbox.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/scsi/megaraid/megaraid_mbox.c > b/drivers/scsi/megaraid/megaraid_mbox.c > index f0987f2..7f11898 100644 > --- a/drivers/scsi/megaraid/megaraid_mbox.c > +++ b/drivers/scsi/megaraid/megaraid_mbox.c > @@ -2603,6 +2603,7 @@ static DEF_SCSI_QCMD(megaraid_queue_command) > list_for_each_entry_safe(scb, tmp, &adapter->pend_list, list) { > list_del_init(&scb->list); // from pending list > > + spin_unlock_irqrestore(PENDING_LIST_LOCK(adapter), flags); > if (scb->sno >= MBOX_MAX_SCSI_CMDS) { > con_log(CL_ANN, (KERN_WARNING > "megaraid: IOCTL packet with %d[%d:%d] being > reset\n", @@ -2630,6 +2631,7 @@ static > DEF_SCSI_QCMD(megaraid_queue_command) > > megaraid_dealloc_scb(adapter, scb); > } > + spin_lock_irqsave(PENDING_LIST_LOCK(adapter), flags); > } > spin_unlock_irqrestore(PENDING_LIST_LOCK(adapter), flags); Sorry for delay. We had internal discussion and confirm that it is safe to remove mbox driver from mainline as this product is discontinued and we are planning to post patch to remove megaraid mbox driver from mainline. > > -- > 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Reduced latency is killing performance
> -Original Message- > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > ow...@vger.kernel.org] On Behalf Of Jens Axboe > Sent: Thursday, November 10, 2016 10:18 PM > To: Hannes Reinecke; Christoph Hellwig > Cc: SCSI Mailing List; linux-bl...@vger.kernel.org > Subject: Re: Reduced latency is killing performance > > On 11/10/2016 09:04 AM, Hannes Reinecke wrote: > > Hi all, > > > > this really feels like a follow-up to the discussion we've had in > > Santa Fe, but finally I'm able to substantiate it with some numbers. > > > > I've made a patch to enable the megaraid_sas driver for multiqueue. > > While this is pretty straightforward (I'll be sending the patchset > > later on), the results are ... interesting. > > > > I've run the 'ssd-test.fio' script from Jens' repository, and these > > results for MQ/SQ (- is mq, + is sq): > > > > Run status group 0 (all jobs): > > - READ: io=10641MB, aggrb=181503KB/s, minb=181503KB/s, > > maxb=181503KB/s, mint=60033msec, maxt=60033msec > > + READ: io=18370MB, aggrb=312572KB/s, minb=312572KB/s, > > maxb=312572KB/s, mint=60181msec, maxt=60181msec > > > > Run status group 1 (all jobs): > > - READ: io=441444KB, aggrb=7303KB/s, minb=7303KB/s, maxb=7303KB/s, > > mint=60443msec, maxt=60443msec > > + READ: io=223108KB, aggrb=3707KB/s, minb=3707KB/s, maxb=3707KB/s, > > mint=60182msec, maxt=60182msec > > > > Run status group 2 (all jobs): > > - WRITE: io=22485MB, aggrb=383729KB/s, minb=383729KB/s, > > maxb=383729KB/s, mint=60001msec, maxt=60001msec > > + WRITE: io=47421MB, aggrb=807581KB/s, minb=807581KB/s, > > maxb=807581KB/s, mint=60129msec, maxt=60129msec > > > > Run status group 3 (all jobs): > > - WRITE: io=489852KB, aggrb=8110KB/s, minb=8110KB/s, maxb=8110KB/s, > > mint=60399msec, maxt=60399msec > > + WRITE: io=489748KB, aggrb=8134KB/s, minb=8134KB/s, maxb=8134KB/s, > > mint=60207msec, maxt=60207msec > > > > Disk stats (read/write): > > - sda: ios=2834412/5878578, merge=0/0, ticks=86269292/48364836, > > in_queue=135345876, util=99.20% > > + sda: ios=205278/2680329, merge=4552593/9580622, > > ticks=12539912/12965228, in_queue=25512312, util=99.59% > > > > As you can see, we're really losing performance in the multiqueue case. > > And the main reason for that is that we submit about _10 times_ as > > much I/O as we do for the single-queue case. > > What's the setup like? I'm going to need more details. > > The baseline test is using the legacy path, single queue. The new test is > multiqueue, scsi-mq. What's sda? Hannes - Please share setup/config details so that I can also validate and post my findings. ` Kashyap > > > So I guess having an I/O scheduler is critical, even for the scsi-mq > > case. > > Each of these sections is a single job. For some reason we are not merging > as > well as we should, that's the reason for the performance loss. In fact, > we're not > merging at all. That's not IO scheduling. > > -- > Jens Axboe > > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of > a message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [GIT PULL] SCSI fixes for 4.9-rc3
> -Original Message- > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > ow...@vger.kernel.org] On Behalf Of Gabriel C > Sent: Friday, November 11, 2016 9:40 AM > To: James Bottomley; Andrew Morton; Linus Torvalds > Cc: linux-scsi; linux-kernel; sta...@vger.kernel.org > Subject: Re: [GIT PULL] SCSI fixes for 4.9-rc3 > > > > On 11.11.2016 04:30, Gabriel C wrote: > > > > On 05.11.2016 14:29, James Bottomley wrote: > > > > > > ... > > > >> Kashyap Desai (1): > >> scsi: megaraid_sas: Fix data integrity failure for JBOD > >> (passthrough) devices > >> > >> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > >> b/drivers/scsi/megaraid/megaraid_sas_base.c > >> index 9ff57de..d8b1fbd 100644 > >> --- a/drivers/scsi/megaraid/megaraid_sas_base.c > >> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > >> @@ -1700,16 +1700,13 @@ megasas_queue_command(struct Scsi_Host > *shost, struct scsi_cmnd *scmd) > >>goto out_done; > >>} > >> > >> - switch (scmd->cmnd[0]) { > >> - case SYNCHRONIZE_CACHE: > >> - /* > >> - * FW takes care of flush cache on its own > >> - * No need to send it down > >> - */ > >> + /* > >> + * FW takes care of flush cache on its own for Virtual Disk. > >> + * No need to send it down for VD. For JBOD send > SYNCHRONIZE_CACHE to FW. > >> + */ > >> + if ((scmd->cmnd[0] == SYNCHRONIZE_CACHE) && > >> +MEGASAS_IS_LOGICAL(scmd)) { > >>scmd->result = DID_OK << 16; > >>goto out_done; > >> - default: > >> - break; > >>} > >> > >>return instance->instancet->build_and_issue_cmd(instance, scmd); > > > > This patch breaks my box.. I'm not able to boot it anymore. > > It seems with this patch I have /dev/sda[a-z] to /dev/sdz[a-z] ?!? > > > > I'm not sure how to get an log since dracut times out and I'm dropped > > , after a very long time of probing 'ghost devices', in a emercency > > shell, > journalctl doesn't work also.. > > > > After reverting this one I can boot normal. > > > > Box is a FUJITSU PRIMERGY TX200 S5.. Please check now commit. Below commit has complete fix. http://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/?id=5e5ec1759dd663a1d5a2f10930224dd009e500e8 > > > > This is from an working kernel.. > > > > [5.119371] megaraid_sas :01:00.0: FW now in Ready state > > [5.119418] megaraid_sas :01:00.0: firmware supports msix > > : (0) > > [5.119420] megaraid_sas :01:00.0: current msix/online cpus > > : (1/16) > > [5.119422] megaraid_sas :01:00.0: RDPQ mode : (disabled) > > [5.123100] ehci-pci :00:1a.7: cache line size of 32 is not > > supported > > [5.123113] ehci-pci :00:1a.7: irq 18, io mem 0xb002 > > > > ... > > > > [5.208063] megaraid_sas :01:00.0: controller type : > > MR(256MB) > > [5.208065] megaraid_sas :01:00.0: Online Controller Reset(OCR) > > : > Enabled > > [5.208067] megaraid_sas :01:00.0: Secure JBOD support : No > > [5.208070] megaraid_sas :01:00.0: megasas_init_mfi: > fw_support_ieee=0 > > [5.208073] megaraid_sas :01:00.0: INIT adapter done > > [5.208075] megaraid_sas :01:00.0: Jbod map is not supported > megasas_setup_jbod_map 4967 > > [5.230163] megaraid_sas :01:00.0: MR_DCMD_PD_LIST_QUERY > failed/not supported by firmware > > [5.252080] megaraid_sas :01:00.0: DCMD not supported by > > firmware - > megasas_ld_list_query 4369 > > [5.274086] megaraid_sas :01:00.0: pci id: > (0x1000)/(0x0060)/(0x1734)/(0x10f9) > > [5.274089] megaraid_sas :01:00.0: unevenspan support: no > > [5.274090] megaraid_sas :01:00.0: firmware crash dump : no > > [5.274092] megaraid_sas :01:00.0: jbod sync map : no > > [5.274094] scsi host0: Avago SAS based MegaRAID driver > > [5.280022] scsi 0:0:6:0: Direct-Access ATA WDC WD5002ABYS-5 > > 3B06 > PQ: 0 ANSI: 5 > > [5.282153] scsi 0:0:7:0: Direct-Access ATA WDC WD5002ABYS-5 > > 3B06 > PQ: 0 ANSI: 5 > > [5.285180] scsi 0:0:10:0: Direct-Access ATA ST500NM0011 > > FTM6 PQ: > 0 ANSI: 5 > > [5.369885] scsi 0:2:0:0: Direct-Access LSI MegaRAID SAS RMB > > 1.40 PQ: > 0 ANSI: 5 > > > > .. > > > > Please let me know if you need more infos and/or want me to test > > patches. > > > > > > I managed to get some parts of the broken dmesg. There it is : > > http://ftp.frugalware.org/pub/other/people/crazy/kernel/broken-dmesg > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of > a message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] hpsa: scsi-mq support
> -Original Message- > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > ow...@vger.kernel.org] On Behalf Of Christoph Hellwig > Sent: Sunday, November 13, 2016 5:29 PM > To: Hannes Reinecke > Cc: Christoph Hellwig; Hannes Reinecke; Christoph Hellwig; Martin K. Petersen; > James Bottomley; Don Brace; linux-scsi@vger.kernel.org; Jens Axboe > Subject: Re: [PATCH] hpsa: scsi-mq support > > On Sun, Nov 13, 2016 at 10:44:47AM +0100, Hannes Reinecke wrote: > > One day to mark with bright red in the calendar. > > > > Christoph Hellwig is telling me _NOT_ to use scsi-mq. > > That's not what I'm doing. > > > This patch was done so see what would needed to be done to convert a > > legacy driver. > > As I was under the impression that scsi-mq is the way forward, seeing > > that it should be enabled per default. > > But I must have been mistaken. Apparently. > > What I am doing is to tell you you should not expose multiple queues unless the > hardware actually has multiple submissions queues. The blk-mq and scsi-mq > code works just fine with a single submission queue, and the hpsa driver in > particular works really well with scsi-mq and a single submission queue. E.g. the > RAID HBA on slide 19 of this presentations is an hpsa one: I have similar results for MegaRaid where seen MR driver gives significant improvement for Single Submission Queue and multiple Completion Queue. Having said that scsi-mq is enabled but with single Queue is more than enough to maximize improvement of SCSI-MQ. Major advantage was seen while IO load is cross the boundary of Physical CPU socket. >From this discussion I understood that - Similar logical changes proposed for megaraid_sas and we are not really going to gain with fake multiple submission exposed to SML. Kashyap > > http://events.linuxfoundation.org/sites/events/files/slides/scsi.pdf > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of > a message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 4/5] megaraid_sas: scsi-mq support
> -Original Message- > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > ow...@vger.kernel.org] On Behalf Of Hannes Reinecke > Sent: Friday, November 11, 2016 3:15 PM > To: Martin K. Petersen > Cc: Christoph Hellwig; James Bottomley; Sumit Saxena; linux- > s...@vger.kernel.org; Hannes Reinecke; Hannes Reinecke > Subject: [PATCH 4/5] megaraid_sas: scsi-mq support > > This patch enables full scsi multiqueue support. But as I have been seeing > performance regressions I've also added a module parameter 'use_blk_mq', > allowing multiqueue to be switched off if required. Hannes - As discussed for hpsa driver related thread for similar support, I don't think this code changes are helping MR as well. > > Signed-off-by: Hannes Reinecke > --- > drivers/scsi/megaraid/megaraid_sas_base.c | 22 + > drivers/scsi/megaraid/megaraid_sas_fusion.c | 38 + > > 2 files changed, 50 insertions(+), 10 deletions(-) > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > b/drivers/scsi/megaraid/megaraid_sas_base.c > index d580406..1af47e6 100644 > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > @@ -48,6 +48,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -104,6 +105,9 @@ > module_param(scmd_timeout, int, S_IRUGO); > MODULE_PARM_DESC(scmd_timeout, "scsi command timeout (10-90s), default > 90s. See megasas_reset_timer."); > > +bool use_blk_mq = true; > +module_param_named(use_blk_mq, use_blk_mq, bool, 0); > + > MODULE_LICENSE("GPL"); > MODULE_VERSION(MEGASAS_VERSION); > MODULE_AUTHOR("megaraidlinux@avagotech.com"); > @@ -1882,6 +1886,17 @@ static void megasas_slave_destroy(struct > scsi_device *sdev) > sdev->hostdata = NULL; > } > > +static int megasas_map_queues(struct Scsi_Host *shost) { > + struct megasas_instance *instance = (struct megasas_instance *) > + shost->hostdata; > + > + if (!instance->ctrl_context) > + return 0; > + > + return blk_mq_pci_map_queues(&shost->tag_set, instance->pdev, 0); } > + > /* > * megasas_complete_outstanding_ioctls - Complete outstanding ioctls after a > * kill adapter > @@ -2986,6 +3001,7 @@ struct device_attribute *megaraid_host_attrs[] = { > .slave_configure = megasas_slave_configure, > .slave_alloc = megasas_slave_alloc, > .slave_destroy = megasas_slave_destroy, > + .map_queues = megasas_map_queues, > .queuecommand = megasas_queue_command, > .eh_target_reset_handler = megasas_reset_target, > .eh_abort_handler = megasas_task_abort, @@ -5610,6 +5626,12 @@ > static int megasas_io_attach(struct megasas_instance *instance) > host->max_id = MEGASAS_MAX_DEV_PER_CHANNEL; > host->max_lun = MEGASAS_MAX_LUN; > host->max_cmd_len = 16; > + host->nr_hw_queues = instance->msix_vectors ? > + instance->msix_vectors : 1; > + if (use_blk_mq) { > + host->can_queue = instance->max_scsi_cmds / host- > >nr_hw_queues; > + host->use_blk_mq = 1; > + } > > /* >* Notify the mid-layer about the new controller diff --git > a/drivers/scsi/megaraid/megaraid_sas_fusion.c > b/drivers/scsi/megaraid/megaraid_sas_fusion.c > index eb3cb0f..aba53c0 100644 > --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c > +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c > @@ -1703,6 +1703,7 @@ static int megasas_create_sg_sense_fusion(struct > megasas_instance *instance) > struct IO_REQUEST_INFO io_info; > struct fusion_context *fusion; > struct MR_DRV_RAID_MAP_ALL *local_map_ptr; > + u16 msix_index; > u8 *raidLUN; > > device_id = MEGASAS_DEV_INDEX(scp); > @@ -1792,11 +1793,13 @@ static int megasas_create_sg_sense_fusion(struct > megasas_instance *instance) > fp_possible = io_info.fpOkForIo; > } > > - /* Use raw_smp_processor_id() for now until cmd->request->cpu is CPU > -id by default, not CPU group id, otherwise all MSI-X queues won't > -be utilized */ > - cmd->request_desc->SCSIIO.MSIxIndex = instance->msix_vectors ? > - raw_smp_processor_id() % instance->msix_vectors : 0; > + if (shost_use_blk_mq(instance->host)) { > + u32 blk_tag = blk_mq_unique_tag(scp->request); > + msix_index = blk_mq_unique_tag_to_hwq(blk_tag); > + } else > + msix_index = instance->msix_vectors ? > + raw_smp_processor_id() % instance->msix_vectors : 0; > + cmd->request_desc->SCSIIO.MSIxIndex = msix_index; > > if (fp_possible) { > megasas_set_pd_lba(io_request, scp->cmd_len, &io_info, scp, > @@ -1971,6 +1974,7 @@ static void megasas_build_ld_nonrw_fusion(struct > megasas_instance *instance, > struct RAID_CONTEXT *pRAID_Context; > struct MR_PD_CFG_SEQ_NUM_SYNC *pd_sync; > struct fusion_context
RE: [PATCH 5/5] megaraid_sas: add mmio barrier after register writes
> -Original Message- > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > ow...@vger.kernel.org] On Behalf Of Tomas Henzl > Sent: Friday, November 18, 2016 9:23 PM > To: Hannes Reinecke; Martin K. Petersen > Cc: Christoph Hellwig; James Bottomley; Sumit Saxena; linux- > s...@vger.kernel.org; Hannes Reinecke > Subject: Re: [PATCH 5/5] megaraid_sas: add mmio barrier after register writes > > On 11.11.2016 10:44, Hannes Reinecke wrote: > > The megaraid_sas HBA only has a single register for I/O submission, > > which will be hit pretty hard with scsi-mq. To ensure that the PCI > > writes have made it across we need to add a mmio barrier after each > > write; otherwise I've been seeing spurious command completions and I/O > > stalls. > > Why is it needed that the PCI write reaches the hw exactly at this point? > Is it possible that this is a hw deficiency like that the hw can't handle > communication without tiny pauses, and so possible to remove in next > generation? > Thanks, > Tomas I think this is good to have mmiowb as we are already doing for writel() version of megasas_return_cmd_fusion. May be not for x86, but for some other CPU arch it is useful. I think it become more evident while scs-mq support for more than one submission queue patch of Hannes expose this issue. Probably this patch is good. Intermediate PCI device (PCI bridge ?) may cache PCI packet and eventually out of order PCI packet to MegaRAID HBA can cause this type of spurious completion. > > > > > Signed-off-by: Hannes Reinecke > > --- > > drivers/scsi/megaraid/megaraid_sas_fusion.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c > > b/drivers/scsi/megaraid/megaraid_sas_fusion.c > > index aba53c0..729a654 100644 > > --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c > > +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c > > @@ -196,6 +196,7 @@ inline void megasas_return_cmd_fusion(struct > megasas_instance *instance, > > le32_to_cpu(req_desc->u.low)); > > > > writeq(req_data, &instance->reg_set->inbound_low_queue_port); > > + mmiowb(); > > #else > > unsigned long flags; > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of > a message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH][V2] scsi: megaraid-sas: fix spelling mistake of "outstanding"
> -Original Message- > From: Bart Van Assche [mailto:bart.vanass...@sandisk.com] > Sent: Wednesday, November 30, 2016 12:50 AM > To: Colin King; Kashyap Desai; Sumit Saxena; Shivasharan S; James E . J . > Bottomley; Martin K . Petersen; megaraidlinux@broadcom.com; linux- > s...@vger.kernel.org > Cc: linux-ker...@vger.kernel.org > Subject: Re: [PATCH][V2] scsi: megaraid-sas: fix spelling mistake of > "outstanding" > > On 11/29/2016 11:13 AM, Colin King wrote: > > Trivial fix to spelling mistake "oustanding" to "outstanding" in > > dev_info and scmd_printk messages. Also join wrapped literal string in > > the scmd_printk. > > Reviewed-by: Bart Van Assche Please hold this commit as we have patch set work in progress for MegaRaid Driver. It will conflict it this patch commits. We will be re-submitting this patch with appropriate Signed-off and Submitted-by tags. Thanks, Kashyap -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 5/5] megaraid_sas: add mmio barrier after register writes
> -Original Message- > From: Tomas Henzl [mailto:the...@redhat.com] > Sent: Monday, November 21, 2016 9:27 PM > To: Kashyap Desai; Hannes Reinecke; Martin K. Petersen > Cc: Christoph Hellwig; James Bottomley; Sumit Saxena; linux- > s...@vger.kernel.org; Hannes Reinecke > Subject: Re: [PATCH 5/5] megaraid_sas: add mmio barrier after register > writes > > On 18.11.2016 17:48, Kashyap Desai wrote: > >> -Original Message- > >> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > >> ow...@vger.kernel.org] On Behalf Of Tomas Henzl > >> Sent: Friday, November 18, 2016 9:23 PM > >> To: Hannes Reinecke; Martin K. Petersen > >> Cc: Christoph Hellwig; James Bottomley; Sumit Saxena; linux- > >> s...@vger.kernel.org; Hannes Reinecke > >> Subject: Re: [PATCH 5/5] megaraid_sas: add mmio barrier after > >> register > > writes > >> On 11.11.2016 10:44, Hannes Reinecke wrote: > >>> The megaraid_sas HBA only has a single register for I/O submission, > >>> which will be hit pretty hard with scsi-mq. To ensure that the PCI > >>> writes have made it across we need to add a mmio barrier after each > >>> write; otherwise I've been seeing spurious command completions and > >>> I/O stalls. > >> Why is it needed that the PCI write reaches the hw exactly at this > > point? > >> Is it possible that this is a hw deficiency like that the hw can't > > handle > >> communication without tiny pauses, and so possible to remove in next > >> generation? > >> Thanks, > >> Tomas > > I think this is good to have mmiowb as we are already doing for > > writel() version of megasas_return_cmd_fusion. > > May be not for x86, but for some other CPU arch it is useful. I think > > it become more evident while scs-mq support for more than one > > submission queue patch of Hannes expose this issue. > > > > Probably this patch is good. Intermediate PCI device (PCI bridge ?) > > may cache PCI packet and eventually out of order PCI packet to > > MegaRAID HBA can cause this type of spurious completion. > > Usually drivers do not add a write barrier after every pci write, unless > like here in > megasas_fire_cmd_fusion in the 32bit part where are two paired writes and > it > must be ensured that the pair arrives without any other write in between. > > Why is it wrong when a pci write is overtaken by another write or when > happens > a bit later and if it is wrong - don't we need an additional locking too ? > The execution of megasas_fire_cmd_fusion might be interrupted and a delay > can happen at any time. Since Hannes mentioned that with his experiment of mq megaraid_sas patch and creating more Submission queue to SML cause invalid/suprious completion in his code, I am trying to understand if " mmiowb" after writeq() call is safe ? My understanding is "writeq" is atomic and it will have two 32bit PCI WRITE in same sequence reaching to PCI end device. Assuming there will not be PCI level caching on Intel x86 platform. E.g if we have two CPU executing writeq(), PCI write will always reach to end device in same sequence. Assume ...Tag-1, Tag-2, Tag-3 and Tag-4 is expected sequence. In case of any optimization at system level, if device see Tag-1, Tag-3, Tag-2, Tag-4 arrives, then we may see the issue as Hannes experience. We have experience very rare instance of dual/spurious SMID completion on released megaraid_sas driver and not easy to be reproduced...so thinking on those line, is this easy to reproduce such issue opening more submission queue to SML (just to reproduce spurious completion effectively). We will apply all the patches posted by Hannes *just* to understand this particular spurious completion issue and understand in what condition it will impact. We will post if this mmiowb() call after writeq() is good to have. ~ Kashyap > > tomash > > > > >>> Signed-off-by: Hannes Reinecke > >>> --- > >>> drivers/scsi/megaraid/megaraid_sas_fusion.c | 1 + > >>> 1 file changed, 1 insertion(+) > >>> > >>> diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c > >>> b/drivers/scsi/megaraid/megaraid_sas_fusion.c > >>> index aba53c0..729a654 100644 > >>> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c > >>> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c > >>> @@ -196,6 +196,7 @@ inline void megasas_return_cmd_fusion(struct > >> megasas_instance *instance, > >>> le32_to_cpu(req_desc->u.low)); > >>> > >>> writeq(req_data, &instance->re
RE: [PATCH] Update 3ware driver email addresses
> -Original Message- > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > ow...@vger.kernel.org] On Behalf Of Martin K. Petersen > Sent: Thursday, December 08, 2016 4:43 AM > To: adam radford > Cc: linux-scsi; Kashyap Desai > Subject: Re: [PATCH] Update 3ware driver email addresses > > >>>>> "Adam" == adam radford writes: > > Adam, > > Adam> This maintainers/email update patch didn't get picked up. Do I > Adam> need to fix it or re-send ? > > I still have it in the queue. Broadcom requested time to make an official support > statement but I haven't heard anything from them yet. Kashyap? Martin - Official support statement from Broadcom - LSI/Broadcom stopped supporting 3Ware controllers. If Adam volunteer keeping it alive, we would like to remove www.lsi.com references and make it purely his driver. Adam - Can you resend patch where "www.lsi.com" is removed from 3Ware drivers. ? Me/Sumit will ack and Martin can take pick for next release. > > -- > Martin K. PetersenOracle Linux Engineering > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of > a message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: SCSI: usage of DID_REQUEUE vs DID_RESET for returning SCSI commands to be retried
> -Original Message- > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > ow...@vger.kernel.org] On Behalf Of Hannes Reinecke > Sent: Wednesday, December 14, 2016 9:07 PM > To: Sumit Saxena; linux-scsi > Subject: Re: SCSI: usage of DID_REQUEUE vs DID_RESET for returning SCSI > commands to be retried > > On 12/13/2016 02:19 PM, Sumit Saxena wrote: > > Hi all, > > > > I have query regarding usage of host_byte DID_REQUEUE vs DID_RESET > > returned by LLD to SCSI mid layer. > > > > Let me give some background here. > > I am using megaraid_sas controller. megaraid_sas driver returns all > > outstanding SCSI commands back to SCSI layer with DID_RESET host_byte > > before resetting controller. > > The intent is- all these commands returned with DID_RESET before > > controller reset should be retried by SCSI. > > > > In few distros, I have observed that if SYNCHRONIZE_CACHE > > command(should be applicable for all non Read write commands) is > > outstanding before resetting controller and driver returns those > > commands back with DID_RESET then SYNCHRONIZE_CACHE command not > > retried but failed to upper layer but other READ/WRITE commands were > > not failed but retried. I was running filesystem IOs and > > SYNHRONIZE_CACHE returned with error cause filesystem to go in READ > > only mode. > > > > Later I checked and realized below commit was missing in that distro > > kernel and seems to cause the problem- > > > > a621bac scsi_lib: correctly retry failed zero length REQ_TYPE_FS > > commands > > > > But distro kernel has below commit- > > > > 89fb4cd scsi: handle flush errors properly > > > > Issue does not hit on the kernels which don't have both commits but > > hits when commit- "89fb4cd scsi: handle flush errors properly " is > > there but commit- "a621bac scsi_lib: correctly retry failed zero > > length REQ_TYPE_FS commands" is missing. > > > > This issue is observed with mpt3sas driver as well and should be > > applicable to all LLDs returning non read write commands with DID_RESET. > > > > Returning DID_REQUEUE instead of DID_RESET from driver solves the > > problem irrespective of whether these above mentioned commits are > > there or not in kernel. > > So I am thinking to use DID_REQUEUE instead of DID_RESET in > > megaraid_sas driver for all SCSI commands(not only limited to > > SYNCHRONIZE_CACHE or non read write commands) before resetting > > controller. As I mentioned intent is those outstanding commands > > returned by driver before doing controller reset should be retried and > > as soon as reset is complete, driver will be processing those > > commands. There is no sense key associated with SCSI commands > > returned. > > > > I browsed SCSI code and get to know DID_REQUEUE causes command to be > > retried by calling- scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY). > > This seems to be good enough for our requirement of commands to be > > re-tried by SCSI layer. > > > > Please provide feedback if anyone forsee any issue with using > > DID_REQUEUE for this use case. > > I will be doing some testing with DID_REQUEUE in place of DID_RESET in > > megaraid_sas driver. > > > > I have attached here a proposed patch for megaraid_sas driver. > > If there are no concerns, I will send this patch to SCSI mailing list. > > > Hmm. > > DID_RESET is meant to be an indicator to the SCSI midlayer that the host / > device was reset, and the command _should_ be retried. > DID_REQUEUE OTOH is an indicator to the SCSI midlayer to retry the > command. > > The problem with DID_RESET is that it's slightly underspecified; there is > no > indication if the command has been processed (and if so, up to which > parts?) > DID_REQUEUE, OTOH, will always cause a retry. > > So yeah, I guess DID_REQUEUE is a better choice here. Thanks Hannes. We thought DID_REQUEUE functionality will be a better choice and we are planning to move that with proposed patch. Thanks for your feedback. We will be sending the final patch to upstream. Now, queuing this change for internal testing. > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Teamlead Storage & Networking > h...@suse.de +49 911 74053 688 > SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 > (AG > Nürnberg) > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of > a message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] preview - block layer help to detect sequential IO
Objective of this patch is - To move code used in bcache module in block layer which is used to find IO stream. Reference code @drivers/md/bcache/request.c check_should_bypass(). This is a high level patch for review and understand if it is worth to follow ? As of now bcache module use this logic, but good to have it in block layer and expose function for external use. In this patch, I move logic of sequential IO search in block layer and exposed function blk_queue_rq_seq_cutoff. Low level driver just need to call if they want stream detection per request queue. For my testing I just added call blk_queue_rq_seq_cutoff(sdev->request_queue, 4) megaraid_sas driver. In general, code of bcache module was referred and they are doing almost same as what we want to do in megaraid_sas driver below patch - http://marc.info/?l=linux-scsi&m=148245616108288&w=2 bcache implementation use search algorithm (hashed based on bio start sector) and detects 128 streams. wanted those implementation to skip sequential IO to be placed on SSD and move it direct to the HDD. Will it be good design to keep this algorithm open at block layer (as proposed in patch.) ? Signed-off-by: Kashyap desai --- diff --git a/block/blk-core.c b/block/blk-core.c index 14d7c07..2e93d14 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -693,6 +693,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) { struct request_queue *q; int err; + struct seq_io_tracker *io; q = kmem_cache_alloc_node(blk_requestq_cachep, gfp_mask | __GFP_ZERO, node_id); @@ -761,6 +762,15 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) if (blkcg_init_queue(q)) goto fail_ref; + + q->sequential_cutoff = 0; + spin_lock_init(&q->io_lock); + INIT_LIST_HEAD(&q->io_lru); + + for (io = q->io; io < q->io + BLK_RECENT_IO; io++) { + list_add(&io->lru, &q->io_lru); + hlist_add_head(&io->hash, q->io_hash + BLK_RECENT_IO); + } return q; @@ -1876,6 +1886,26 @@ static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors) return 0; } +static void add_sequential(struct task_struct *t) +{ +#define blk_ewma_add(ewma, val, weight, factor) \ +({ \ +(ewma) *= (weight) - 1; \ +(ewma) += (val) << factor; \ +(ewma) /= (weight); \ +(ewma) >> factor; \ +}) + + blk_ewma_add(t->sequential_io_avg, +t->sequential_io, 8, 0); + + t->sequential_io = 0; +} +static struct hlist_head *blk_iohash(struct request_queue *q, uint64_t k) +{ + return &q->io_hash[hash_64(k, BLK_RECENT_IO_BITS)]; +} + static noinline_for_stack bool generic_make_request_checks(struct bio *bio) { @@ -1884,6 +1914,7 @@ static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors) int err = -EIO; char b[BDEVNAME_SIZE]; struct hd_struct *part; + struct task_struct *task = current; might_sleep(); @@ -1957,6 +1988,42 @@ static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors) if (!blkcg_bio_issue_check(q, bio)) return false; + if (q->sequential_cutoff) { + struct seq_io_tracker *i; + unsigned sectors; + + spin_lock(&q->io_lock); + + hlist_for_each_entry(i, blk_iohash(q, bio->bi_iter.bi_sector), hash) + if (i->last == bio->bi_iter.bi_sector && + time_before(jiffies, i->jiffies)) + goto found; + + i = list_first_entry(&q->io_lru, struct seq_io_tracker, lru); + + add_sequential(task); + i->sequential = 0; +found: + if (i->sequential + bio->bi_iter.bi_size > i->sequential) + i->sequential += bio->bi_iter.bi_size; + + i->last = bio_end_sector(bio); + i->jiffies = jiffies + msecs_to_jiffies(5000); + task->sequential_io = i->sequential; + + hlist_del(&i->hash); + hlist_add_head(&i->hash, blk_iohash(q, i->last)); + list_move_tail(&i->lru, &q->io_lru); + + spin_unlock(&q->io_lock); + + sectors = max(task->sequential_io, + task->sequential_io_avg) >> 9; +
RE: [PATCH] preview - block layer help to detect sequential IO
> -Original Message- > From: kbuild test robot [mailto:l...@intel.com] > Sent: Thursday, January 12, 2017 1:18 AM > To: Kashyap Desai > Cc: kbuild-...@01.org; linux-scsi@vger.kernel.org; linux-bl...@vger.kernel.org; > ax...@kernel.dk; martin.peter...@oracle.com; j...@linux.vnet.ibm.com; > sumit.sax...@broadcom.com; Kashyap desai > Subject: Re: [PATCH] preview - block layer help to detect sequential IO > > Hi Kashyap, > > [auto build test ERROR on v4.9-rc8] > [cannot apply to block/for-next linus/master linux/master next-20170111] [if > your patch is applied to the wrong git tree, please drop us a note to help > improve the system] > > url: https://github.com/0day-ci/linux/commits/Kashyap-Desai/preview-block- > layer-help-to-detect-sequential-IO/20170112-024228 > config: i386-randconfig-a0-201702 (attached as .config) > compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 > reproduce: > # save the attached .config to linux build tree > make ARCH=i386 > > All errors (new ones prefixed by >>): > >block/blk-core.c: In function 'add_sequential': > >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member named > 'sequential_io_avg' > blk_ewma_add(t->sequential_io_avg, This error fixable. For now, I just wanted to get high level review of the idea. Below defines are required to use sequential_io and sequential_io_avg. I have enable BCACHE for my testing in .config. #if defined(CONFIG_BCACHE) || defined(CONFIG_BCACHE_MODULE) unsigned intsequential_io; unsigned intsequential_io_avg; #endif Looking for high level review comment. ` Kashyap >^ >block/blk-core.c:1893:10: note: in definition of macro 'blk_ewma_add' > (ewma) *= (weight) - 1; \ > ^~~~ > >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member named > 'sequential_io_avg' > blk_ewma_add(t->sequential_io_avg, >^ >block/blk-core.c:1894:10: note: in definition of macro 'blk_ewma_add' > (ewma) += (val) << factor; \ > ^~~~ > >> block/blk-core.c:1900:5: error: 'struct task_struct' has no member named > 'sequential_io' >t->sequential_io, 8, 0); > ^ >block/blk-core.c:1894:20: note: in definition of macro 'blk_ewma_add' > (ewma) += (val) << factor; \ >^~~ > >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member named > 'sequential_io_avg' > blk_ewma_add(t->sequential_io_avg, >^ >block/blk-core.c:1895:10: note: in definition of macro 'blk_ewma_add' > (ewma) /= (weight); \ > ^~~~ > >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member named > 'sequential_io_avg' > blk_ewma_add(t->sequential_io_avg, >^ >block/blk-core.c:1896:10: note: in definition of macro 'blk_ewma_add' > (ewma) >> factor; \ > ^~~~ >block/blk-core.c:1902:3: error: 'struct task_struct' has no member named > 'sequential_io' > t->sequential_io = 0; > ^~ >block/blk-core.c: In function 'generic_make_request_checks': >block/blk-core.c:2012:7: error: 'struct task_struct' has no member named > 'sequential_io' > task->sequential_io = i->sequential; > ^~ >In file included from block/blk-core.c:14:0: >block/blk-core.c:2020:21: error: 'struct task_struct' has no member named > 'sequential_io' > sectors = max(task->sequential_io, > ^ >include/linux/kernel.h:747:2: note: in definition of macro '__max' > t1 max1 = (x); \ > ^~ >block/blk-core.c:2020:13: note: in expansion of macro 'max' > sectors = max(task->sequential_io, > ^~~ >block/blk-core.c:2020:21: error: 'struct task_struct' has no member named > 'sequential_io' > sectors = max(task->sequential_io, > ^ >include/linux/kernel.h:747:13: note: in definition of macro '__max' > t1 max1 = (x); \ > ^ >block/blk-core.c:2020:13: note: in expansion of macro 'max' > sectors = max(task->sequential_io, > ^~~ >block/blk-core.c:2021:14: error: 'struct task_struct' has no member named > 'sequential_io_avg' >
RE: [PATCH] preview - block layer help to detect sequential IO
> Hi, Kashyap, > > I'm CC-ing Kent, seeing how this is his code. Hi Jeff and Kent, See my reply inline. > > Kashyap Desai writes: > > > Objective of this patch is - > > > > To move code used in bcache module in block layer which is used to > > find IO stream. Reference code @drivers/md/bcache/request.c > > check_should_bypass(). This is a high level patch for review and > > understand if it is worth to follow ? > > > > As of now bcache module use this logic, but good to have it in block > > layer and expose function for external use. > > > > In this patch, I move logic of sequential IO search in block layer and > > exposed function blk_queue_rq_seq_cutoff. Low level driver just need > > to call if they want stream detection per request queue. For my > > testing I just added call blk_queue_rq_seq_cutoff(sdev->request_queue, > > 4) megaraid_sas driver. > > > > In general, code of bcache module was referred and they are doing > > almost same as what we want to do in megaraid_sas driver below patch - > > > > http://marc.info/?l=linux-scsi&m=148245616108288&w=2 > > > > bcache implementation use search algorithm (hashed based on bio start > > sector) and detects 128 streams. wanted those implementation > > to skip sequential IO to be placed on SSD and move it direct to the > > HDD. > > > > Will it be good design to keep this algorithm open at block layer (as > > proposed in patch.) ? > > It's almost always a good idea to avoid code duplication, but this patch > definitely needs some work. Jeff, I was not aware of the actual block layer module, so created just a working patch to explain my point. Check new patch. This patch is driver changes only in driver. 1. Below MR driver patch does similar things but code is Array base linear lookup. http://marc.info/?l=linux-scsi&m=148245616108288&w=2 2. I thought to improve this using appended patch. It is similar of what is doing. This patch has duplicate code as is doing the same. > > I haven't looked terribly closely at the bcache implementaiton, so do let me > know if I've misinterpreted something. > > We should track streams per io_context/queue pair. We already have a data > structure for that, the io_cq. Right now that structure is tailored for use by the > I/O schedulers, but I'm sure we could rework that. That would also get rid of the > tremedous amount of bloat this patch adds to the request_queue. It will also > allow us to remove the bcache-specific fields that were added to task_struct. > Overall, it should be a good simplification, unless I've completely missed the > point (which happens). Your understanding of requirement is correct. What we need is tracker of in block layer and check the tracker for every request to know if this is a random or sequential IO. As you explained, there is a similar logic in ..I search the kernel code and figure out below code section @ block/elevator.c /* * See if our hash lookup can find a potential backmerge. */ __rq = elv_rqhash_find(q, bio->bi_iter.bi_sector); I am looking for similar logic done in elv_rqhash_find() for all the IOs and provide information in request, if this particular request is a potential back-merge candidate (Having new req_flags_t e.a RQF_SEQ) . It is OK, even thought it was not merged due to other checks in IO path. Safer side (to avoid any performance issues), we can opt for API to be called by low level driver on particular request queue/sdev, if someone is interested in this request queue such help ? I need help (some level of patch to work on) or pointer, if this path is good. I can drive this, but need to understand direction. > > I don't like that you put sequential I/O detection into bio_check_eod. > Split it out into its own function. Sorry for this. I thought of sending patch to get better understanding. My first patch was very high level and not complaint with many design or coding issue. For my learning - BTW, for such post (if I have high level patch) ..what shall I do ? > You've added a member to struct bio that isn't referenced. It would have been > nice of you to put enough work into this RFC so that we could at least see how > the common code was used by bcache and your driver. See my second patch appended here. I can work on block layer generic changes, if we have some another area (as mentioned elevator/cfq) doing the stuffs which I am looking for. > > EWMA (exponentially weighted moving average) is not an acronym I keep handy > in my head. It would be nice to add documentation on the algorithm and design > choices. More comments in the code would also be appreciated. CFQ does > some
RE: Device or HBA level QD throttling creates randomness in sequetial workload
Hi Jens/Omar, I used git.kernel.dk/linux-block branch - blk-mq-sched (commit 0efe27068ecf37ece2728a99b863763286049ab5) and confirm that issue reported in this thread is resolved. Now I am seeing MQ and SQ mode both are resulting in sequential IO pattern while IO is getting re-queued in block layer. To make similar performance without blk-mq-sched feature, is it good to pause IO for few usec in LLD? I mean, I want to avoid driver asking SML/Block layer to re-queue the IO (if it is Sequential on Rotational media.) Explaining w.r.t megaraid_sas driver. This driver expose can_queue, but it internally consume commands for raid 1, fast path. In worst case, can_queue/2 will consume all firmware resources and driver will re-queue further IOs to SML as below - if (atomic_inc_return(&instance->fw_outstanding) > instance->host->can_queue) { atomic_dec(&instance->fw_outstanding); return SCSI_MLQUEUE_HOST_BUSY; } I want to avoid above SCSI_MLQUEUE_HOST_BUSY. Need your suggestion for below changes - diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c index 9a9c84f..a683eb0 100644 --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c @@ -54,6 +54,7 @@ #include #include #include +#include #include "megaraid_sas_fusion.h" #include "megaraid_sas.h" @@ -2572,7 +2573,15 @@ void megasas_prepare_secondRaid1_IO(struct megasas_instance *instance, struct megasas_cmd_fusion *cmd, *r1_cmd = NULL; union MEGASAS_REQUEST_DESCRIPTOR_UNION *req_desc; u32 index; - struct fusion_context *fusion; + boolis_nonrot; + u32 safe_can_queue; + u32 num_cpus; + struct fusion_context *fusion; + + fusion = instance->ctrl_context; + + num_cpus = num_online_cpus(); + safe_can_queue = instance->cur_can_queue - num_cpus; fusion = instance->ctrl_context; @@ -2584,11 +2593,15 @@ void megasas_prepare_secondRaid1_IO(struct megasas_instance *instance, return SCSI_MLQUEUE_DEVICE_BUSY; } - if (atomic_inc_return(&instance->fw_outstanding) > - instance->host->can_queue) { - atomic_dec(&instance->fw_outstanding); - return SCSI_MLQUEUE_HOST_BUSY; - } + if (atomic_inc_return(&instance->fw_outstanding) > safe_can_queue) { + is_nonrot = blk_queue_nonrot(scmd->device->request_queue); + /* For rotational device wait for sometime to get fusion command from pool. +* This is just to reduce proactive re-queue at mid layer which is not +* sending sorted IO in SCSI.MQ mode. +*/ + if (!is_nonrot) + udelay(100); + } cmd = megasas_get_cmd_fusion(instance, scmd->request->tag); ` Kashyap > -Original Message- > From: Kashyap Desai [mailto:kashyap.de...@broadcom.com] > Sent: Tuesday, November 01, 2016 11:11 AM > To: 'Jens Axboe'; 'Omar Sandoval' > Cc: 'linux-scsi@vger.kernel.org'; 'linux-ker...@vger.kernel.org'; 'linux- > bl...@vger.kernel.org'; 'Christoph Hellwig'; 'paolo.vale...@linaro.org' > Subject: RE: Device or HBA level QD throttling creates randomness in > sequetial workload > > Jens- Replied inline. > > > Omar - I tested your WIP repo and figure out System hangs only if I pass > " > scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I > am looking for scsi_mod.use_blk_mq=Y. > > Also below is snippet of blktrace. In case of higher per device QD, I see > Requeue request in blktrace. > > 65,128 10 6268 2.432404509 18594 P N [fio] > 65,128 10 6269 2.432405013 18594 U N [fio] 1 > 65,128 10 6270 2.432405143 18594 I WS 148800 + 8 [fio] > 65,128 10 6271 2.432405740 18594 R WS 148800 + 8 [0] > 65,128 10 6272 2.432409794 18594 Q WS 148808 + 8 [fio] > 65,128 10 6273 2.432410234 18594 G WS 148808 + 8 [fio] > 65,128 10 6274 2.432410424 18594 S WS 148808 + 8 [fio] > 65,128 23 3626 2.432432595 16232 D WS 148800 + 8 > [kworker/23:1H] > 65,128 22 3279 2.432973482 0 C WS 147432 + 8 [0] > 65,128 7 6126 2.433032637 18594 P N [fio] > 65,128 7 6127 2.433033204 18594 U N [fio] 1 > 65,128 7 6128 2.433033346 18594 I WS 148808 + 8 [fio] > 65,128 7 6129 2.433033871 18594 D WS 148808 + 8 [fio] > 65,128 7 6130 2.433034559 18594 R WS 148808 + 8 [0] > 65,128 7 6131 2.433039796 18594 Q WS 148816 + 8 [fio] > 65,128 7 6132 2.433040206 18594 G WS 148816 + 8 [fio] > 65,128 7 6133 2.433040351 18594 S WS 148816 + 8 [fio] > 65,128 9 6392 2.433133729 0 C WS 147240 + 8 [0] > 65,128 9 6393
RE: Device or HBA level QD throttling creates randomness in sequetial workload
> -Original Message- > From: Jens Axboe [mailto:ax...@kernel.dk] > Sent: Monday, January 30, 2017 10:03 PM > To: Bart Van Assche; osan...@osandov.com; kashyap.de...@broadcom.com > Cc: linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org; > h...@infradead.org; linux-bl...@vger.kernel.org; paolo.vale...@linaro.org > Subject: Re: Device or HBA level QD throttling creates randomness in > sequetial workload > > On 01/30/2017 09:30 AM, Bart Van Assche wrote: > > On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote: > >> - if (atomic_inc_return(&instance->fw_outstanding) > > >> - instance->host->can_queue) { > >> - atomic_dec(&instance->fw_outstanding); > >> - return SCSI_MLQUEUE_HOST_BUSY; > >> - } > >> + if (atomic_inc_return(&instance->fw_outstanding) > safe_can_queue) { > >> + is_nonrot = blk_queue_nonrot(scmd->device->request_queue); > >> + /* For rotational device wait for sometime to get fusion > >> + command > >> from pool. > >> +* This is just to reduce proactive re-queue at mid layer > >> + which is > >> not > >> +* sending sorted IO in SCSI.MQ mode. > >> +*/ > >> + if (!is_nonrot) > >> + udelay(100); > >> + } > > > > The SCSI core does not allow to sleep inside the queuecommand() > > callback function. > > udelay() is a busy loop, so it's not sleeping. That said, it's obviously NOT a > great idea. We want to fix the reordering due to requeues, not introduce > random busy delays to work around it. Thanks for feedback. I do realize that udelay() is going to be very odd in queue_command call back. I will keep this note. Preferred solution is blk mq scheduler patches. > > -- > Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 00/10] mpt3sas: full mq support
> -Original Message- > From: Hannes Reinecke [mailto:h...@suse.de] > Sent: Tuesday, January 31, 2017 4:47 PM > To: Christoph Hellwig > Cc: Martin K. Petersen; James Bottomley; linux-scsi@vger.kernel.org; Sathya > Prakash; Kashyap Desai; mpt-fusionlinux@broadcom.com > Subject: Re: [PATCH 00/10] mpt3sas: full mq support > > On 01/31/2017 11:02 AM, Christoph Hellwig wrote: > > On Tue, Jan 31, 2017 at 10:25:50AM +0100, Hannes Reinecke wrote: > >> Hi all, > >> > >> this is a patchset to enable full multiqueue support for the mpt3sas > driver. > >> While the HBA only has a single mailbox register for submitting > >> commands, it does have individual receive queues per MSI-X interrupt > >> and as such does benefit from converting it to full multiqueue support. > > > > Explanation and numbers on why this would be beneficial, please. > > We should not need multiple submissions queues for a single register > > to benefit from multiple completion queues. > > > Well, the actual throughput very strongly depends on the blk-mq-sched > patches from Jens. > As this is barely finished I didn't post any numbers yet. > > However: > With multiqueue support: > 4k seq read : io=60573MB, bw=1009.2MB/s, iops=258353, runt= 60021msec > With scsi-mq on 1 queue: > 4k seq read : io=17369MB, bw=296291KB/s, iops=74072, runt= 60028msec So > yes, there _is_ a benefit. > > (Which is actually quite cool, as these tests were done on a SAS3 HBA, so > we're getting close to the theoretical maximum of 1.2GB/s). > (Unlike the single-queue case :-) Hannes - Can you share detail about setup ? How many drives do you have and how is connection (enclosure -> drives. ??) ? To me it looks like current mpt3sas driver might be taking more hit in spinlock operation (penalty on NUMA arch is more compare to single core server) unlike we have in megaraid_sas driver use of shared blk tag. I mean " [PATCH 08/10] mpt3sas: lockless command submission for scsi-mq" patch is improving performance removing spinlock overhead and attempting to get request using blk_tags. Are you seeing performance improvement if you hard code nr_hw_queues = 1 in below code changes part of "[PATCH 10/10] mpt3sas: scsi-mq interrupt steering" @@ -9054,6 +9071,8 @@ static void sas_device_make_active(struct MPT3SAS_ADAPTER *ioc, shost->max_lun = max_lun; shost->transportt = mpt3sas_transport_template; shost->unique_id = ioc->id; + if (shost->use_blk_mq) + shost->nr_hw_queues = num_online_cpus(); Thanks, Kashyap > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Teamlead Storage & Networking > h...@suse.de +49 911 74053 688 > SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 > (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 00/10] mpt3sas: full mq support
> -Original Message- > From: Hannes Reinecke [mailto:h...@suse.de] > Sent: Wednesday, February 01, 2017 12:21 PM > To: Kashyap Desai; Christoph Hellwig > Cc: Martin K. Petersen; James Bottomley; linux-scsi@vger.kernel.org; > Sathya > Prakash Veerichetty; PDL-MPT-FUSIONLINUX; Sreekanth Reddy > Subject: Re: [PATCH 00/10] mpt3sas: full mq support > > On 01/31/2017 06:54 PM, Kashyap Desai wrote: > >> -Original Message- > >> From: Hannes Reinecke [mailto:h...@suse.de] > >> Sent: Tuesday, January 31, 2017 4:47 PM > >> To: Christoph Hellwig > >> Cc: Martin K. Petersen; James Bottomley; linux-scsi@vger.kernel.org; > > Sathya > >> Prakash; Kashyap Desai; mpt-fusionlinux@broadcom.com > >> Subject: Re: [PATCH 00/10] mpt3sas: full mq support > >> > >> On 01/31/2017 11:02 AM, Christoph Hellwig wrote: > >>> On Tue, Jan 31, 2017 at 10:25:50AM +0100, Hannes Reinecke wrote: > >>>> Hi all, > >>>> > >>>> this is a patchset to enable full multiqueue support for the > >>>> mpt3sas > >> driver. > >>>> While the HBA only has a single mailbox register for submitting > >>>> commands, it does have individual receive queues per MSI-X > >>>> interrupt and as such does benefit from converting it to full > >>>> multiqueue > > support. > >>> > >>> Explanation and numbers on why this would be beneficial, please. > >>> We should not need multiple submissions queues for a single register > >>> to benefit from multiple completion queues. > >>> > >> Well, the actual throughput very strongly depends on the blk-mq-sched > >> patches from Jens. > >> As this is barely finished I didn't post any numbers yet. > >> > >> However: > >> With multiqueue support: > >> 4k seq read : io=60573MB, bw=1009.2MB/s, iops=258353, runt= > 60021msec > >> With scsi-mq on 1 queue: > >> 4k seq read : io=17369MB, bw=296291KB/s, iops=74072, runt= 60028msec > >> So yes, there _is_ a benefit. > >> > >> (Which is actually quite cool, as these tests were done on a SAS3 > >> HBA, > > so > >> we're getting close to the theoretical maximum of 1.2GB/s). > >> (Unlike the single-queue case :-) > > > > Hannes - > > > > Can you share detail about setup ? How many drives do you have and how > > is connection (enclosure -> drives. ??) ? > > To me it looks like current mpt3sas driver might be taking more hit in > > spinlock operation (penalty on NUMA arch is more compare to single > > core > > server) unlike we have in megaraid_sas driver use of shared blk tag. > > > The tests were done with a single LSI SAS3008 connected to a NetApp E- > series (2660), using 4 LUNs under MD-RAID0. > > Megaraid_sas is even worse here; due to the odd nature of the 'fusion' > implementation we're ending up having _two_ sets of tags, making it really > hard to use scsi-mq here. Current megaraid_sas as single submission queue exposed to the blk-mq will not encounter similar performance issue. We may not see significant improvement of performance if we attempt the same for megaraid_sas driver. We had similar discussion for megaraid_sas and hpsa. http://www.spinics.net/lists/linux-scsi/msg101838.html I am seeing this patch series is similar attempt for mpt3sas..Am I missing anything ? Megaraid_sas driver just do indexing from blk_tag and fire IO quick enough unlike mpt3sas where we have lock contention @driver level as bottleneck. > (Not that I didn't try; but lacking a proper backend it's really hard to > evaluate > the benefit of those ... spinning HDDs simply don't cut it here) > > > I mean " [PATCH 08/10] mpt3sas: lockless command submission for scsi- > mq" > > patch is improving performance removing spinlock overhead and > > attempting to get request using blk_tags. > > Are you seeing performance improvement if you hard code nr_hw_queues > > = 1 in below code changes part of "[PATCH 10/10] mpt3sas: scsi-mq > > interrupt steering" > > > No. The numbers posted above are generated with exactly that patch; the > first line is running with nr_hw_queues=32 and the second line with > nr_hw_queues=1. Thanks Hannes. That clarifies. Can you share script you have used ? If my understanding correct, you will see theoretical maximum of 1.2GBp/s if you restrict your work load to single numa node. This is just for understanding if driver spinlocks are adding overhead. We have seen such overhead on multi-socket server an
RE: [PATCH 03/39] megaraid_sas: raid 1 fast path code optimize
> > > > /** > > + * megasas_complete_r1_command - > > + * completes R1 FP write commands which has valid peer smid > > + * @instance: Adapter soft state > > + * @cmd_fusion:MPT command frame > > + * > > + */ > > +static inline void > > +megasas_complete_r1_command(struct megasas_instance *instance, > > + struct megasas_cmd_fusion *cmd) { > > + u8 *sense, status, ex_status; > > + u32 data_length; > > + u16 peer_smid; > > + struct fusion_context *fusion; > > + struct megasas_cmd_fusion *r1_cmd = NULL; > > + struct scsi_cmnd *scmd_local = NULL; > > + struct RAID_CONTEXT_G35 *rctx_g35; > > + > > + rctx_g35 = &cmd->io_request->RaidContext.raid_context_g35; > > + fusion = instance->ctrl_context; > > + peer_smid = le16_to_cpu(rctx_g35->smid.peer_smid); > > + > > + r1_cmd = fusion->cmd_list[peer_smid - 1]; > > + scmd_local = cmd->scmd; > > + status = rctx_g35->status; > > + ex_status = rctx_g35->ex_status; > > + data_length = cmd->io_request->DataLength; > > + sense = cmd->sense; > > + > > + cmd->cmd_completed = true; > > + > > + /* Check if peer command is completed or not*/ > > + if (r1_cmd->cmd_completed) { > > + if (rctx_g35->status != MFI_STAT_OK) { > > + status = rctx_g35->status; > > + ex_status = rctx_g35->ex_status; > > Both status + ex_status were already set to the same value, why is it > repeated here ? Tomas, This need a fix. Raid context should be switch to r1_cmd, but it that is not done here. We want if r1 cmd is completed with failure, check status and extended status from r1_cmd to send final status to mid layer. We will fix this and resend patch. It will be like this - if (r1_cmd->cmd_completed) { rctx_g35 = &r1_cmd->io_request->RaidContext.raid_context_g35;<< -This line should be added. if (rctx_g35->status != MFI_STAT_OK) { status = rctx_g35->status; ex_status = rctx_g35->ex_status; Thanks, Kashyap > > Tomas >
RE: [PATCH 33/39] megaraid_sas: call flush_scheduled_work during controller shutdown/detach
> -Original Message- > From: Tomas Henzl [mailto:the...@redhat.com] > Sent: Monday, February 06, 2017 9:35 PM > To: Shivasharan S; linux-scsi@vger.kernel.org > Cc: martin.peter...@oracle.com; j...@linux.vnet.ibm.com; > kashyap.de...@broadcom.com; sumit.sax...@broadcom.com; > h...@suse.com > Subject: Re: [PATCH 33/39] megaraid_sas: call flush_scheduled_work during > controller shutdown/detach > > On 6.2.2017 11:00, Shivasharan S wrote: > > Signed-off-by: Kashyap Desai > > Signed-off-by: Shivasharan S > > > --- > > drivers/scsi/megaraid/megaraid_sas_base.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > b/drivers/scsi/megaraid/megaraid_sas_base.c > > index 04ef0a0..b29cfd3 100644 > > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > > @@ -6393,6 +6393,7 @@ megasas_suspend(struct pci_dev *pdev, > pm_message_t state) > > if (instance->ev != NULL) { > > struct megasas_aen_event *ev = instance->ev; > > cancel_delayed_work_sync(&ev->hotplug_work); > > + flush_scheduled_work(); > > instance->ev = NULL; > > } > > > > @@ -6619,6 +6620,7 @@ static void megasas_detach_one(struct pci_dev > *pdev) > > if (instance->ev != NULL) { > > struct megasas_aen_event *ev = instance->ev; > > cancel_delayed_work_sync(&ev->hotplug_work); > > + flush_scheduled_work(); > > instance->ev = NULL; > > } > > > > Why is cancel_delayed_work_sync not good enough? Megaraid_sas driver use certain work on global work queue. Below are the listed one - if (instance->ctrl_context) { INIT_WORK(&instance->work_init, megasas_fusion_ocr_wq); INIT_WORK(&instance->crash_init, megasas_fusion_crash_dump_wq); } else INIT_WORK(&instance->work_init, process_fw_state_change_wq) Cancel_delayed_work_sync() was mainly targeted for only hotplug AEN work. Calling flush_scheduled_work() we want above listed work to be completed as well. > > tomash
RE: [PATCH 33/39] megaraid_sas: call flush_scheduled_work during controller shutdown/detach
> -Original Message- > From: Kashyap Desai [mailto:kashyap.de...@broadcom.com] > Sent: Monday, February 06, 2017 10:48 PM > To: 'Tomas Henzl'; Shivasharan Srikanteshwara; 'linux-scsi@vger.kernel.org' > Cc: 'martin.peter...@oracle.com'; 'j...@linux.vnet.ibm.com'; Sumit Saxena; > 'h...@suse.com' > Subject: RE: [PATCH 33/39] megaraid_sas: call flush_scheduled_work during > controller shutdown/detach > > > -Original Message- > > From: Tomas Henzl [mailto:the...@redhat.com] > > Sent: Monday, February 06, 2017 9:35 PM > > To: Shivasharan S; linux-scsi@vger.kernel.org > > Cc: martin.peter...@oracle.com; j...@linux.vnet.ibm.com; > > kashyap.de...@broadcom.com; sumit.sax...@broadcom.com; > h...@suse.com > > Subject: Re: [PATCH 33/39] megaraid_sas: call flush_scheduled_work > > during controller shutdown/detach > > > > On 6.2.2017 11:00, Shivasharan S wrote: > > > Signed-off-by: Kashyap Desai > > > Signed-off-by: Shivasharan S > > > > > --- > > > drivers/scsi/megaraid/megaraid_sas_base.c | 2 ++ > > > 1 file changed, 2 insertions(+) > > > > > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > > b/drivers/scsi/megaraid/megaraid_sas_base.c > > > index 04ef0a0..b29cfd3 100644 > > > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > > > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > > > @@ -6393,6 +6393,7 @@ megasas_suspend(struct pci_dev *pdev, > > pm_message_t state) > > > if (instance->ev != NULL) { > > > struct megasas_aen_event *ev = instance->ev; > > > cancel_delayed_work_sync(&ev->hotplug_work); > > > + flush_scheduled_work(); > > > instance->ev = NULL; > > > } > > > > > > @@ -6619,6 +6620,7 @@ static void megasas_detach_one(struct pci_dev > > *pdev) > > > if (instance->ev != NULL) { > > > struct megasas_aen_event *ev = instance->ev; > > > cancel_delayed_work_sync(&ev->hotplug_work); > > > + flush_scheduled_work(); > > > instance->ev = NULL; > > > } > > > > > > > Why is cancel_delayed_work_sync not good enough? > > Megaraid_sas driver use certain work on global work queue. > > Below are the listed one - > > if (instance->ctrl_context) { > INIT_WORK(&instance->work_init, megasas_fusion_ocr_wq); > INIT_WORK(&instance->crash_init, > megasas_fusion_crash_dump_wq); > } > else > INIT_WORK(&instance->work_init, > process_fw_state_change_wq) > > Cancel_delayed_work_sync() was mainly targeted for only hotplug AEN work. > Calling flush_scheduled_work() we want above listed work to be completed > as well. Tomas - Here is one more update. I agree with your assessment. We don't need this patch. In our local repo code was like below and as part of sync up activity, I did not realize that upstream is using cancel_delayed_work_sync() which is internally doing the same as below. cancel_delayed_work(&ev->hotplug_work); flush_scheduled_work(); Just for info - Similar patch was posted for mpt2sas long time ago to replace above combination with cancel_delayed_work_sync() https://lkml.org/lkml/2010/12/21/127 We will accommodate removal of this patch in V2 submission. > > > > > tomash
RE: [PATCH 13/39] megaraid_sas : set residual bytes count during IO compeltion
> -Original Message- > From: Martin K. Petersen [mailto:martin.peter...@oracle.com] > Sent: Tuesday, February 07, 2017 5:22 AM > To: Shivasharan S > Cc: linux-scsi@vger.kernel.org; martin.peter...@oracle.com; > the...@redhat.com; j...@linux.vnet.ibm.com; > kashyap.de...@broadcom.com; sumit.sax...@broadcom.com; > h...@suse.com > Subject: Re: [PATCH 13/39] megaraid_sas : set residual bytes count during IO > compeltion > > > "Shivasharan" == Shivasharan S > writes: > > Shivasharan> Fixing issue of not setting residual bytes correctly. > > @@ -1464,6 +1465,15 @@ map_cmd_status(struct fusion_context *fusion, > SCSI_SENSE_BUFFERSIZE); > scmd->result |= DRIVER_SENSE << 24; > } > + > + /* > + * If the IO request is partially completed, then MR FW will > + * update "io_request->DataLength" field with actual number > of > + * bytes transferred.Driver will set residual bytes count in > + * SCSI command structure. > + */ > + resid = (scsi_bufflen(scmd) - data_length); > + scsi_set_resid(scmd, resid); > > Is data_length guaranteed to be a multiple of the logical block size? > Otherwise you need to tweak the residual like we just did for mpt3sas. Martin, Data length will be always guaranteed to be a multiple of the logical block size until and unless we have some firmware defect. In past, We have seen some partial/complete DMA data length return from firmware was not aligned with logical block size. Eventually, root caused + fixed in firmware. > > -- > Martin K. PetersenOracle Linux Engineering
RE: [PATCH v2 03/39] megaraid_sas: raid 1 fast path code optimize
> > +static inline void > > +megasas_complete_r1_command(struct megasas_instance *instance, > > + struct megasas_cmd_fusion *cmd) { > > + u8 *sense, status, ex_status; > > + u32 data_length; > > + u16 peer_smid; > > + struct fusion_context *fusion; > > + struct megasas_cmd_fusion *r1_cmd = NULL; > > + struct scsi_cmnd *scmd_local = NULL; > > + struct RAID_CONTEXT_G35 *rctx_g35; > > + > > + rctx_g35 = &cmd->io_request->RaidContext.raid_context_g35; > > + fusion = instance->ctrl_context; > > + peer_smid = le16_to_cpu(rctx_g35->smid.peer_smid); > > + > > + r1_cmd = fusion->cmd_list[peer_smid - 1]; > > + scmd_local = cmd->scmd; > > + status = rctx_g35->status; > > + ex_status = rctx_g35->ex_status; > > + data_length = cmd->io_request->DataLength; > > + sense = cmd->sense; > > + > > + cmd->cmd_completed = true; > > Please help me understand how this works > - there are two peer commands sent to the controller > - both are completed and the later calls scsi_done and returns both r1_cmd > + cmd > - if both commands can be completed at the same time, is it possible that > the > above line is executed at the same moment for both completions ? > How is the code protected against a double completion when both > completed commands see the peer cmd_completed as set ? Tomas, cmd and r1_cmd (part of same Raid 1 FP) will be always completed on same reply queue by firmware. That is one of the key requirement here for raid 1 fast path. What you ask is possible if FW completes cmd and r1_cmd on different reply queue. If you notice when we clone r1_cmd, we also clone MSI-x index from parent command. So eventually, FW is aware of binding of both cmd and r1_cmd w.r.t reply queue index. ` Kashyap > > > +
RE: [PATCH v2 19/39] megaraid_sas: MR_TargetIdToLdGet u8 to u16 and avoid invalid raid-map access
> Signed-off-by: Shivasharan S > Signed-off-by: Kashyap Desai In this patch series, we are done with review but this particular patch missed Review-by tag. Kashyap
RE: [PATCH v2 21/39] megaraid_sas: big endian support changes
> +static inline void set_num_sge(struct RAID_CONTEXT_G35 rctx_g35, > +u16 sge_count) > +{ > + rctx_g35.u.bytes[0] = (u8)(sge_count & NUM_SGE_MASK_LOWER); > + rctx_g35.u.bytes[1] |= (u8)((sge_count >> NUM_SGE_SHIFT_UPPER) > + & > NUM_SGE_MASK_UPPER); > +} This function and below get_num_sge() need fix. We have supposed to pass pointer of struct RAID_CONTEXT_G35 to get correct setting reflected in IO frame, otherwise it just set in stack local memory and that is not a intent here. We will fix this patch and resend. Only fixing this patch and resend works fine with complete series (there is no hunk failure observe), so just going to push one particular patch with below title. [PATCH v2 21/39 RESEND] megaraid_sas: big endian support changes > + > +static inline u16 get_num_sge(struct RAID_CONTEXT_G35 rctx_g35) { > + u16 sge_count; > + > + sge_count = (u16)(((rctx_g35.u.bytes[1] & NUM_SGE_MASK_UPPER) > + << NUM_SGE_SHIFT_UPPER) | (rctx_g35.u.bytes[0])); > + return sge_count; > +} > + > +#define SET_STREAM_DETECTED(rctx_g35) \ > + (rctx_g35.u.bytes[1] |= STREAM_DETECT_MASK) > + > +#define CLEAR_STREAM_DETECTED(rctx_g35) \ > + (rctx_g35.u.bytes[1] &= ~(STREAM_DETECT_MASK)) > + > +static inline bool is_stream_detected(struct RAID_CONTEXT_G35 > +*rctx_g35) { > + return ((rctx_g35->u.bytes[1] & STREAM_DETECT_MASK)); } > + > union RAID_CONTEXT_UNION { > struct RAID_CONTEXT raid_context; > struct RAID_CONTEXT_G35 raid_context_g35; > -- > 2.8.3
[PATCH] return valid data buffer length in scsi_bufflen() API using RQF_SPECIAL_PAYLOAD
Regression due to commit f9d03f96b988002027d4b28ea1b7a24729a4c9b5 block: improve handling of the magic discard payload and HBA FW encounter FW fault in DMA operation while creating File system on SSDs. Below CDB cause FW fault. CDB: Write same(16) 93 08 00 00 00 00 00 00 00 00 00 00 80 00 00 00 Root cause is SCSI buffer length and DMA buffer length miss match for WRITE SAME command. Fix - return valid data buffer length in scsi_bufflen() API using RQF_SPECIAL_PAYLOAD Signed-off-by: Kashyap Desai --- diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h index 9fc1aec..1f796fc 100644 --- a/include/scsi/scsi_cmnd.h +++ b/include/scsi/scsi_cmnd.h @@ -180,7 +180,8 @@ static inline struct scatterlist *scsi_sglist(struct scsi_cmnd *cmd) static inline unsigned scsi_bufflen(struct scsi_cmnd *cmd) { - return cmd->sdb.length; + return (cmd->request->rq_flags & RQF_SPECIAL_PAYLOAD) ? + cmd->request->special_vec.bv_len : cmd->sdb.length; } static inline void scsi_set_resid(struct scsi_cmnd *cmd, int resid)
RE: [PATCH 00/10] mpt3sas: full mq support
> > > Hannes, > > Result I have posted last time is with merge operation enabled in block > layer. If I disable merge operation then I don't see much improvement > with > multiple hw request queues. Here is the result, > > fio results when nr_hw_queues=1, > 4k read when numjobs=24: io=248387MB, bw=1655.1MB/s, iops=423905, > runt=150003msec > > fio results when nr_hw_queues=24, > 4k read when numjobs=24: io=263904MB, bw=1759.4MB/s, iops=450393, > runt=150001msec Hannes - I worked with Sreekanth and also understand pros/cons of Patch #10. " [PATCH 10/10] mpt3sas: scsi-mq interrupt steering" In above patch, can_queue of HBA is divided based on logic CPU, it means we want to mimic as if mpt3sas HBA support multi queue distributing actual resources which is single Submission H/W Queue. This approach badly impact many performance areas. nr_hw_queues = 1 is what I observe as best performance approach since it never throttle IO if sdev->queue_depth is set to HBA queue depth. In case of nr_hw_queues = "CPUs" throttle IO at SCSI level since we never allow more than "updated can_queue" in LLD. Below code bring actual HBA can_queue very low ( Ea on 96 logical core CPU new can_queue goes to 42, if HBA queue depth is 4K). It means we will see lots of IO throttling in scsi mid layer due to shost->can_queue reach the limit very soon if you have jobs with higher QD. if (ioc->shost->nr_hw_queues > 1) { ioc->shost->nr_hw_queues = ioc->msix_vector_count; ioc->shost->can_queue /= ioc->msix_vector_count; } I observe negative performance if I have 8 SSD drives attached to Ventura (latest IT controller). 16 fio jobs at QD=128 gives ~1600K IOPs and the moment I switch to nr_hw_queues = "CPUs", it gave hardly ~850K IOPs. This is mainly because of host_busy stuck at very low ~169 on my setup. May be as Sreekanth mentioned, performance improvement you have observed is due to nomerges=2 is not set and OS will attempt soft back/front merge. I debug live machine and understood we never see parallel instance of "scsi_dispatch_cmd" as we expect due to can_queue is less. If we really has *very* large HBA QD, this patch #10 to expose multiple SQ may be useful. For now, we are looking for updated version of patch which will only keep IT HBA in SQ mode (like we are doing in driver) and add interface to use blk_tag in both scsi.mq and !scsi.mq mode. Sreekanth has already started working on it, but we may need to check full performance test run to post the actual patch. May be we can cherry pick few patches from this series and get blk_tag support to improve performance of later which will not allow use to choose nr_hw_queue to be tunable. Thanks, Kashyap > > Thanks, > Sreekanth
RE: [PATCH 00/10] mpt3sas: full mq support
> -Original Message- > From: Hannes Reinecke [mailto:h...@suse.de] > Sent: Wednesday, February 15, 2017 3:35 PM > To: Kashyap Desai; Sreekanth Reddy > Cc: Christoph Hellwig; Martin K. Petersen; James Bottomley; linux- > s...@vger.kernel.org; Sathya Prakash Veerichetty; PDL-MPT-FUSIONLINUX > Subject: Re: [PATCH 00/10] mpt3sas: full mq support > > On 02/15/2017 10:18 AM, Kashyap Desai wrote: > >> > >> > >> Hannes, > >> > >> Result I have posted last time is with merge operation enabled in > >> block layer. If I disable merge operation then I don't see much > >> improvement with multiple hw request queues. Here is the result, > >> > >> fio results when nr_hw_queues=1, > >> 4k read when numjobs=24: io=248387MB, bw=1655.1MB/s, iops=423905, > >> runt=150003msec > >> > >> fio results when nr_hw_queues=24, > >> 4k read when numjobs=24: io=263904MB, bw=1759.4MB/s, iops=450393, > >> runt=150001msec > > > > Hannes - > > > > I worked with Sreekanth and also understand pros/cons of Patch #10. > > " [PATCH 10/10] mpt3sas: scsi-mq interrupt steering" > > > > In above patch, can_queue of HBA is divided based on logic CPU, it > > means we want to mimic as if mpt3sas HBA support multi queue > > distributing actual resources which is single Submission H/W Queue. > > This approach badly impact many performance areas. > > > > nr_hw_queues = 1 is what I observe as best performance approach since > > it never throttle IO if sdev->queue_depth is set to HBA queue depth. > > In case of nr_hw_queues = "CPUs" throttle IO at SCSI level since we > > never allow more than "updated can_queue" in LLD. > > > True. > And this was actually one of the things I wanted to demonstrate with this > patchset :-) ATM blk-mq really works best when having a distinct tag space > per port/device. As soon as the hardware provides a _shared_ tag space you > end up with tag starvation issues as blk-mq only allows you to do a static > split of the available tagspace. > While this patchset demonstrates that the HBA itself _does_ benefit from > using block-mq (especially on highly parallel loads), it also demonstrates > that > _block-mq_ has issues with singlethreaded loads on this HBA (or, rather, > type of HBA, as I doubt this issue is affecting mpt3sas only). > > > Below code bring actual HBA can_queue very low ( Ea on 96 logical core > > CPU new can_queue goes to 42, if HBA queue depth is 4K). It means we > > will see lots of IO throttling in scsi mid layer due to > > shost->can_queue reach the limit very soon if you have jobs with > higher QD. > > > > if (ioc->shost->nr_hw_queues > 1) { > > ioc->shost->nr_hw_queues = ioc->msix_vector_count; > > ioc->shost->can_queue /= ioc->msix_vector_count; > > } > > I observe negative performance if I have 8 SSD drives attached to > > Ventura (latest IT controller). 16 fio jobs at QD=128 gives ~1600K > > IOPs and the moment I switch to nr_hw_queues = "CPUs", it gave hardly > > ~850K IOPs. This is mainly because of host_busy stuck at very low ~169 > > on > my setup. > > > Which actually might be an issue with the way scsi is hooked into blk-mq. > The SCSI stack is using 'can_queue' as a check for 'host_busy', ie if the > host is > capable of accepting more commands. > As we're limiting can_queue (to get the per-queue command depth > correctly) we should be using the _overall_ command depth for the > can_queue value itself to make the host_busy check work correctly. > > I've attached a patch for that; can you test if it makes a difference? Hannes - Attached patch works fine for me. FYI - We need to set device queue depth to can_queue as we are currently not doing in mpt3sas driver. With attached patch when I tried, I see ~2-3% improvement running multiple jobs. Single job profile no difference. So looks like we are good to reach performance with single nr_hw_queues. We have some patches to be send so want to know how to rebase this patch series as few patches coming from Broadcom. Can we consider below as plan ? - Patches from 1-7 will be reposted. Also Sreekanth will complete review on existing patch 1-7. - We need blk_tag support only for nr_hw_queue = 1. With that say, we will have many code changes/function without " shost_use_blk_mq" check and assume it is single nr_hw_queue supported driver. Ea - Below function can be simplify - just refer tag from scmd->request and don't need check of shost_use_blk_mq + nr_hw_queue etc.. u16 mpt
RE: [PATCH 00/10] mpt3sas: full mq support
> > - Later we can explore if nr_hw_queue more than one really add benefit. > > From current limited testing, I don't see major performance boost if > > we have nr_hw_queue more than one. > > > Well, the _actual_ code to support mq is rather trivial, and really serves > as a > good testbed for scsi-mq. > I would prefer to leave it in, and disable it via a module parameter. I am thinking as adding extra code for more than one nr_hw_queue will add maintenance overhead and support. Especially IO error handling code become complex with nr_hw_queues > 1 case. If we really like to see performance boost, we should attempt and bare other side effect. For time being we should drop this nr_hw_queue > 1 support is what I choose (not even module parameter base). > > But in either case, I can rebase the patches to leave any notions of > 'nr_hw_queues' to patch 8 for implementing full mq support. Thanks Hannes. It was just heads up...We are not sure when we can submit upcoming patch set from Broadcom. May be we can syncup with you offline in case any rebase requires. > > And we need to discuss how to handle MPI2_FUNCTION_SCSI_IO_REQUEST; > the current method doesn't work with blk-mq. > I really would like to see that go, especially as sg/bsg supports the same > functionality ... >
[PATCH] megaraid_sas: enable intx only if msix request fails
Without this fix, driver will enable INTx Interrupt pin even though MSI-x vectors are enabled. See below lspci output. DisINTx is unset for MSIx setup. lspci -s 85:00.0 -vvv |grep INT |grep Control Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx- After applying this fix, driver will enable INTx Interrupt pin only if Legacy interrupt method is required. See below lspci output. DisINTx is unset for MSIx setup. lspci -s 85:00.0 -vvv |grep INT |grep Control Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Signed-off-by: Kashyap Desai --- drivers/scsi/megaraid/megaraid_sas_base.c | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c index 7ac9a9e..82a8ec8 100644 --- a/drivers/scsi/megaraid/megaraid_sas_base.c +++ b/drivers/scsi/megaraid/megaraid_sas_base.c @@ -4990,6 +4990,7 @@ int megasas_set_crash_dump_params(struct megasas_instance *instance, struct pci_dev *pdev; pdev = instance->pdev; + pci_intx(pdev, 1); instance->irq_context[0].instance = instance; instance->irq_context[0].MSIxIndex = 0; if (request_irq(pci_irq_vector(pdev, 0), @@ -5277,10 +5278,6 @@ static int megasas_init_fw(struct megasas_instance *instance) MPI2_REPLY_POST_HOST_INDEX_OFFSET); } - i = pci_alloc_irq_vectors(instance->pdev, 1, 1, PCI_IRQ_LEGACY); - if (i < 0) - goto fail_setup_irqs; - dev_info(&instance->pdev->dev, "firmware supports msix\t: (%d)", fw_msix_count); dev_info(&instance->pdev->dev, @@ -5494,7 +5491,6 @@ static int megasas_init_fw(struct megasas_instance *instance) instance->instancet->disable_intr(instance); fail_init_adapter: megasas_destroy_irqs(instance); -fail_setup_irqs: if (instance->msix_vectors) pci_free_irq_vectors(instance->pdev); instance->msix_vectors = 0; -- 1.8.3.1
RE: megasas: failed to get PD list
Peter: We will check with Falcon MR controller and will get back to you. Can you quickly check using stable kernel instead of upstream. (Just to check if it is regression>) ? ~ Kashyap > -Original Message- > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > ow...@vger.kernel.org] On Behalf Of Peter Andersson > Sent: Friday, July 18, 2014 12:01 AM > To: linux-scsi@vger.kernel.org > Subject: megasas: failed to get PD list > > Hi. > I'm having major problems with my "MegaRAID SAS 9240-81" raid controller. > Basically the kernel doesn't find my raid drives. > > Bios version: 4.38.02.0 (January 16 2014) The bios of the raid controller shows > a fully working (and initialized) raid 0. > > Kernel: 3.15.1-031501-generic (ubuntu server 14.04) > > This is the full dmesg regarding the card: > > [0.902723] megasas: 06.803.01.00-rc1 Mon. Mar. 10 17:00:00 PDT 2014 > [0.902788] megasas: 0x1000:0x0073:0x1000:0x9240: bus 3:slot 0:func 0 > [0.903121] megasas: FW now in Ready state > [0.903180] megaraid_sas :03:00.0: irq 80 for MSI/MSI-X > [0.903188] megaraid_sas :03:00.0: [scsi0]: FW supports<0> MSIX > vector,Online CPUs: <12>,Current MSIX <1> > [0.924336] megasas_init_mfi: fw_support_ieee=67108864 > [0.924372] megasas: INIT adapter done > [ 72.855404] megasas: failed to get PD list > > lspci: > 03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2008 > [Falcon] (rev 03) > > megaraidsas-status: > -- Arrays informations -- > -- ID | Type | Size | Status > > -- Disks informations > -- ID | Model | Status | Warnings > > After the drives are initialized i also get this message: > Virtual drive handled by bios > > That's all i can think to be of interest to you. > Could anyone give me some pointers on what could be wrong? > > Thanks! > > /Peter > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the > body of a message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2 RESEND 14/23] megaraid: Use pci_enable_msix_range() instead of pci_enable_msix()
> -Original Message- > From: Alexander Gordeev [mailto:agord...@redhat.com] > Sent: Monday, August 11, 2014 1:40 PM > To: Kashyap Desai; Neela Syam Kolli > Cc: linux-scsi@vger.kernel.org; linux-...@vger.kernel.org; linux- > ker...@vger.kernel.org; Christoph Hellwig > Subject: Re: [PATCH v2 RESEND 14/23] megaraid: Use > pci_enable_msix_range() instead of pci_enable_msix() > > On Wed, Jul 16, 2014 at 08:05:18PM +0200, Alexander Gordeev wrote: > > As result of deprecation of MSI-X/MSI enablement functions > > pci_enable_msix() and pci_enable_msi_block() all drivers using these > > two interfaces need to be updated to use the new > > pci_enable_msi_range() or pci_enable_msi_exact() and > > pci_enable_msix_range() or pci_enable_msix_exact() interfaces. > > Kashyap, Neela, > > Could you please reveiw this patch? Review this patch. Please consider this patch Acked by Me. Acked-by: Kashyap Desai > > Thanks! > > > Signed-off-by: Alexander Gordeev > > Cc: Neela Syam Kolli > > Cc: linux-scsi@vger.kernel.org > > Cc: linux-...@vger.kernel.org > > --- > > drivers/scsi/megaraid/megaraid_sas_base.c | 20 +++- > > 1 files changed, 7 insertions(+), 13 deletions(-) > > > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > > b/drivers/scsi/megaraid/megaraid_sas_base.c > > index ba06102..7a4e75e 100644 > > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > > @@ -4105,17 +4105,11 @@ static int megasas_init_fw(struct > megasas_instance *instance) > > (unsigned int)num_online_cpus()); > > for (i = 0; i < instance->msix_vectors; i++) > > instance->msixentry[i].entry = i; > > - i = pci_enable_msix(instance->pdev, instance->msixentry, > > - instance->msix_vectors); > > - if (i >= 0) { > > - if (i) { > > - if (!pci_enable_msix(instance->pdev, > > -instance->msixentry, i)) > > - instance->msix_vectors = i; > > - else > > - instance->msix_vectors = 0; > > - } > > - } else > > + i = pci_enable_msix_range(instance->pdev, instance- > >msixentry, > > + 1, instance->msix_vectors); > > + if (i) > > + instance->msix_vectors = i; > > + else > > instance->msix_vectors = 0; > > > > dev_info(&instance->pdev->dev, "[scsi%d]: FW supports" > > @@ -5135,8 +5129,8 @@ megasas_resume(struct pci_dev *pdev) > > > > /* Now re-enable MSI-X */ > > if (instance->msix_vectors && > > - pci_enable_msix(instance->pdev, instance->msixentry, > > - instance->msix_vectors)) > > + pci_enable_msix_exact(instance->pdev, instance->msixentry, > > + instance->msix_vectors)) > > goto fail_reenable_msix; > > > > switch (instance->pdev->device) { > > -- > > 1.7.7.6 > > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
> -Original Message- > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > ow...@vger.kernel.org] On Behalf Of Christoph Hellwig > Sent: Friday, July 18, 2014 3:43 PM > To: James Bottomley; linux-scsi@vger.kernel.org > Cc: Jens Axboe; Bart Van Assche; Mike Christie; Martin K. Petersen; Robert > Elliott; Webb Scales; linux-ker...@vger.kernel.org > Subject: [PATCH 13/14] scsi: add support for a blk-mq based I/O path. > > This patch adds support for an alternate I/O path in the scsi midlayer which > uses the blk-mq infrastructure instead of the legacy request code. > > Use of blk-mq is fully transparent to drivers, although for now a host > template field is provided to opt out of blk-mq usage in case any unforseen > incompatibilities arise. > > In general replacing the legacy request code with blk-mq is a simple and > mostly mechanical transformation. The biggest exception is the new code > that deals with the fact the I/O submissions in blk-mq must happen from > process context, which slightly complicates the I/O completion handler. > The second biggest differences is that blk-mq is build around the concept of > preallocated requests that also include driver specific data, which in SCSI > context means the scsi_cmnd structure. This completely avoids dynamic > memory allocations for the fast path through I/O submission. > > Due the preallocated requests the MQ code path exclusively uses the host- > wide shared tag allocator instead of a per-LUN one. This only affects drivers > actually using the block layer provided tag allocator instead of their own. > Unlike the old path blk-mq always provides a tag, although drivers don't have > to use it. > > For now the blk-mq path is disable by defauly and must be enabled using the > "use_blk_mq" module parameter. Once the remaining work in the block > layer to make blk-mq more suitable for slow devices is complete I hope to > make it the default and eventually even remove the old code path. > > Based on the earlier scsi-mq prototype by Nicholas Bellinger. > > Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking and > various sugestions and code contributions. > > Signed-off-by: Christoph Hellwig > Reviewed-by: Hannes Reinecke > Reviewed-by: Webb Scales > Acked-by: Jens Axboe > Tested-by: Bart Van Assche > Tested-by: Robert Elliott > --- > drivers/scsi/hosts.c | 35 +++- > drivers/scsi/scsi.c | 5 +- > drivers/scsi/scsi_lib.c | 464 > -- > drivers/scsi/scsi_priv.h | 3 + > drivers/scsi/scsi_scan.c | 5 +- > drivers/scsi/scsi_sysfs.c | 2 + > include/scsi/scsi_host.h | 18 +- > include/scsi/scsi_tcq.h | 28 ++- > 8 files changed, 488 insertions(+), 72 deletions(-) > > diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index 0632eee..6de80e3 > 100644 > --- a/drivers/scsi/hosts.c > +++ b/drivers/scsi/hosts.c > @@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host > *shost, struct device *dev, > goto fail; > } > > + if (shost_use_blk_mq(shost)) { > + error = scsi_mq_setup_tags(shost); > + if (error) > + goto fail; > + } > + > + /* > + * Note that we allocate the freelist even for the MQ case for now, > + * as we need a command set aside for scsi_reset_provider. Having > + * the full host freelist and one command available for that is a > + * little heavy-handed, but avoids introducing a special allocator > + * just for this. Eventually the structure of scsi_reset_provider > + * will need a major overhaul. > + */ > error = scsi_setup_command_freelist(shost); > if (error) > - goto fail; > + goto out_destroy_tags; > + > > if (!shost->shost_gendev.parent) > shost->shost_gendev.parent = dev ? dev : &platform_bus; > @@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, > struct device *dev, > > error = device_add(&shost->shost_gendev); > if (error) > - goto out; > + goto out_destroy_freelist; > > pm_runtime_set_active(&shost->shost_gendev); > pm_runtime_enable(&shost->shost_gendev); > @@ -279,8 +294,11 @@ int scsi_add_host_with_dma(struct Scsi_Host > *shost, struct device *dev, > device_del(&shost->shost_dev); > out_del_gendev: > device_del(&shost->shost_gendev); > - out: > + out_destroy_freelist: > scsi_destroy_command_freelist(shost); > + out_destroy_tags: > + if (shost_use_blk_mq(shost)) > + scsi_mq_destroy_tags(shost); > fail: > return error; > } > @@ -309,8 +327,13 @@ static void scsi_host_dev_release(struct device > *dev) > } > > scsi_destroy_command_freelist(shost); > - if (shost->bqt) > - blk_free_tags(shost->bqt); > + if (shost_use_blk_mq(shost)) { > + if (shost->tag_set.tags) > + scsi_mq_destroy_tag
Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
On Tue, Aug 19, 2014 at 3:51 AM, Kashyap Desai wrote: > > > -Original Message- > > From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > > ow...@vger.kernel.org] On Behalf Of Christoph Hellwig > > Sent: Friday, July 18, 2014 3:43 PM > > To: James Bottomley; linux-scsi@vger.kernel.org > > Cc: Jens Axboe; Bart Van Assche; Mike Christie; Martin K. Petersen; > Robert > > Elliott; Webb Scales; linux-ker...@vger.kernel.org > > Subject: [PATCH 13/14] scsi: add support for a blk-mq based I/O path. > > > > This patch adds support for an alternate I/O path in the scsi midlayer > which > > uses the blk-mq infrastructure instead of the legacy request code. > > > > Use of blk-mq is fully transparent to drivers, although for now a host > > template field is provided to opt out of blk-mq usage in case any > unforseen > > incompatibilities arise. > > > > In general replacing the legacy request code with blk-mq is a simple and > > mostly mechanical transformation. The biggest exception is the new code > > that deals with the fact the I/O submissions in blk-mq must happen from > > process context, which slightly complicates the I/O completion handler. > > The second biggest differences is that blk-mq is build around the > concept of > > preallocated requests that also include driver specific data, which in > SCSI > > context means the scsi_cmnd structure. This completely avoids dynamic > > memory allocations for the fast path through I/O submission. > > > > Due the preallocated requests the MQ code path exclusively uses the > host- > > wide shared tag allocator instead of a per-LUN one. This only affects > drivers > > actually using the block layer provided tag allocator instead of their > own. > > Unlike the old path blk-mq always provides a tag, although drivers don't > have > > to use it. > > > > For now the blk-mq path is disable by defauly and must be enabled using > the > > "use_blk_mq" module parameter. Once the remaining work in the block > > layer to make blk-mq more suitable for slow devices is complete I hope > to > > make it the default and eventually even remove the old code path. > > > > Based on the earlier scsi-mq prototype by Nicholas Bellinger. > > > > Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking > and > > various sugestions and code contributions. > > > > Signed-off-by: Christoph Hellwig > > Reviewed-by: Hannes Reinecke > > Reviewed-by: Webb Scales > > Acked-by: Jens Axboe > > Tested-by: Bart Van Assche > > Tested-by: Robert Elliott > > --- > > drivers/scsi/hosts.c | 35 +++- > > drivers/scsi/scsi.c | 5 +- > > drivers/scsi/scsi_lib.c | 464 > > -- > > drivers/scsi/scsi_priv.h | 3 + > > drivers/scsi/scsi_scan.c | 5 +- > > drivers/scsi/scsi_sysfs.c | 2 + > > include/scsi/scsi_host.h | 18 +- > > include/scsi/scsi_tcq.h | 28 ++- > > 8 files changed, 488 insertions(+), 72 deletions(-) > > > > diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index > 0632eee..6de80e3 > > 100644 > > --- a/drivers/scsi/hosts.c > > +++ b/drivers/scsi/hosts.c > > @@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host > > *shost, struct device *dev, > > goto fail; > > } > > > > + if (shost_use_blk_mq(shost)) { > > + error = scsi_mq_setup_tags(shost); > > + if (error) > > + goto fail; > > + } > > + > > + /* > > + * Note that we allocate the freelist even for the MQ case for > now, > > + * as we need a command set aside for scsi_reset_provider. Having > > + * the full host freelist and one command available for that is a > > + * little heavy-handed, but avoids introducing a special allocator > > + * just for this. Eventually the structure of scsi_reset_provider > > + * will need a major overhaul. > > + */ > > error = scsi_setup_command_freelist(shost); > > if (error) > > - goto fail; > > + goto out_destroy_tags; > > + > > > > if (!shost->shost_gendev.parent) > > shost->shost_gendev.parent = dev ? dev : &platform_bus; > > @@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, > > struct device *dev, > > > > error = device_add(&shost->shost_gendev); > > if (erro
Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
On Tue, Aug 19, 2014 at 9:36 PM, Christoph Hellwig wrote: > On Tue, Aug 19, 2014 at 03:51:42AM +0530, Kashyap Desai wrote: >> I read this comment and find that very few drivers are using this >> cmd_list. I think if we remove this cmd_list, performance will scale as I >> am seeing major contention in this lock. >> Just thought to ping you to see if this is known limitation for now or any >> plan to change this lock in near future ? > > Removing the lock entirely and pushing the list into the two drivers > using it is on my TODO list. Bart actually suggested keeping the code in the > SCSI core and having a flag to enabled. Given that I'm too busy to get my > full > version done in time, it might be a good idea if someone picks up Barts > idea. Can you send me a patch to add a enable_cmd_list flag to the host > template and only enable it for aacraid and dpt_i2o? > Sure. I will work on relevant code change and will post patch for review. -- Device Driver Developer @ Avagotech Kashyap D. Desai Note - my new email address kashyap.de...@avagotech.com -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] scsi.mq:Added enable_cmd_list flags in hostt to reduce lock contention
> -Original Message- > From: Bart Van Assche [mailto:bvanass...@acm.org] > Sent: Wednesday, August 20, 2014 4:18 PM > To: kashyap.de...@avagotech.com; linux-scsi@vger.kernel.org > Cc: aacr...@adaptec.com; elli...@hp.com; jbottom...@parallels.com; > h...@infradead.org > Subject: Re: [PATCH] scsi.mq:Added enable_cmd_list flags in hostt to > reduce > lock contention > > On 08/19/14 20:17, kashyap.de...@avagotech.com wrote: > > + if (shost->hostt->enable_cmd_list) { > > This code is in the hot path which means that caching "enable_cmd_list" > in struct Scsi_Host (as is done for many other SCSI host parameters) > probably > will (slightly) improve performance further. Otherwise this patch looks > fine to > me. I will send updated patch which will cache host template field "enable_cmd_list" for faster access in Scsi Host. ` Kashyap > > Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] scsi.mq:Added enable_cmd_list flags in hostt to reduce lock contention
> -Original Message- > From: Christoph Hellwig [mailto:h...@infradead.org] > Sent: Wednesday, August 20, 2014 6:11 PM > To: Kashyap Desai > Cc: Bart Van Assche; linux-scsi@vger.kernel.org; aacr...@adaptec.com; > elli...@hp.com; jbottom...@parallels.com; h...@infradead.org > Subject: Re: [PATCH] scsi.mq:Added enable_cmd_list flags in hostt to reduce > lock contention > > On Wed, Aug 20, 2014 at 06:08:37PM +0530, Kashyap Desai wrote: > > > This code is in the hot path which means that caching "enable_cmd_list" > > > in struct Scsi_Host (as is done for many other SCSI host parameters) > > > probably will (slightly) improve performance further. Otherwise this > > > patch looks fine to me. > > I will send updated patch which will cache host template field > > "enable_cmd_list" for faster access in Scsi Host. > > Thanks. It might be worth to only set in in the host in fact. Fine. I will remove host template entry and add code in aacraid and dpt_i2o drivers to set that value directly in "Scsi Host" > > Also please just remove the code about lock contention in scsi_mq_prep_fn > - the XXX really doesn't apply anymore and I think the code should be self- > explaining enough to not need a comment. I will do this. > > Otherwise the patch looks good to me. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 02/11] megaraid_sas : Use writeq for 64bit pci write to avoid spinlock overhead
> -Original Message- > From: Tomas Henzl [mailto:the...@redhat.com] > Sent: Tuesday, September 09, 2014 7:01 PM > To: sumit.sax...@avagotech.com; linux-scsi@vger.kernel.org > Cc: martin.peter...@oracle.com; h...@infradead.org; > jbottom...@parallels.com; kashyap.de...@avagotech.com; > aradf...@gmail.com > Subject: Re: [PATCH 02/11] megaraid_sas : Use writeq for 64bit pci write to > avoid spinlock overhead > > On 09/06/2014 03:25 PM, sumit.sax...@avagotech.com wrote: > > Use writeq() for 64bit PCI write instead of writel() to avoid additional lock > overhead. > > > > Signed-off-by: Sumit Saxena > > Signed-off-by: Kashyap Desai > > --- > > drivers/scsi/megaraid/megaraid_sas_fusion.c | 8 > > 1 file changed, 8 insertions(+) > > > > diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c > > b/drivers/scsi/megaraid/megaraid_sas_fusion.c > > index 57b47fe..c69c1ac 100644 > > --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c > > +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c > > @@ -1065,6 +1065,13 @@ megasas_fire_cmd_fusion(struct > megasas_instance *instance, > > u32 req_desc_hi, > > struct megasas_register_set __iomem *regs) > > Hi Sumit, > the fn params are a bit confusing req_desc_lo is of type dma_addr_t and > req_desc_hi is u32, is it possible to unite it in the future? Agree. We should make changes here. We will do it in separate patch. Originally fire_cmd() was written for MFI controller and carry forward for all generation. In MFI it use second argument as 32 bit address and third argument as frame count, but later in Fusion adapter it started using differently. > > > { > > +#if defined(writeq) && defined(CONFIG_64BIT) > > On a similar place mpt2sas(_base_writeq) uses only "#ifndef writeq" > if it's incorrect fix it there too or remove the CONFIG_64 here We would like to change at mpt2sas as we have all the code with below check for writeq() Original discuss was started when we submitted this change in mpt2sas, but we have delta from day-1. LSI/Avago internal source has "#if defined(writeq) && defined(CONFIG_64BIT)" check in mpt2sas. I think now writeq() is implemented in all arch, so we can safely remove check for #if writeq(). But we can keep this check as it is to continue for older Distribution to take direct advantage without maintaining any separate patch. > > > + u64 req_data = 0; > > + > > + req_data = req_desc_hi; > > + req_data = ((req_data << 32) | (u32)req_desc_lo); > > This seems to be critical path (you are removing an spinlock to avoid > overhead), so why do you have three consecutive assignments to the same > variable? > (~(u64 req_data = r_hi << 32 | r_lo)) Agree. We will be doing this change and re-submit the patch to address this. > > Cheers, > Tomas > > > + writeq(le64_to_cpu(req_data), &(regs)- > >inbound_low_queue_port); > > +#else > > unsigned long flags; > > > > spin_lock_irqsave(&instance->hba_lock, flags); @@ -1072,6 +1079,7 > @@ > > megasas_fire_cmd_fusion(struct megasas_instance *instance, > > writel(le32_to_cpu(req_desc_lo), &(regs)- > >inbound_low_queue_port); > > writel(le32_to_cpu(req_desc_hi), &(regs)- > >inbound_high_queue_port); > > spin_unlock_irqrestore(&instance->hba_lock, flags); > > +#endif > > } > > > > /** -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/11] megaraid_sas : Add module parameter to disable IRQ-CPU affinity hint
On Wed, Sep 10, 2014 at 7:46 PM, Tomas Henzl wrote: > On 09/06/2014 03:25 PM, sumit.sax...@avagotech.com wrote: >> For certain deployment, we may need to disable irq cpu affinity hint. >> This module parameter provides option for use to disable irq cpu affinity >> hint >> and allow irqbalancer to handle the rest. > > Only curious , in which environments causes this which issues? For few deployment cases where CPU is managed via Cgroup and user knows how they are going to use their system resources. For those cases driver hinting may not be useful. For simplicity customer might like this type of module parameter. Kashyap > > Thanks, > Tomas > >> >> Signed-off-by: Sumit Saxena >> Signed-off-by: Kashyap Desai >> --- >> drivers/scsi/megaraid/megaraid_sas_base.c | 60 >> +++ >> 1 file changed, 38 insertions(+), 22 deletions(-) >> >> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c >> b/drivers/scsi/megaraid/megaraid_sas_base.c >> index 07255c1..086beee 100644 >> --- a/drivers/scsi/megaraid/megaraid_sas_base.c >> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c >> @@ -89,6 +89,10 @@ module_param(resetwaittime, int, S_IRUGO); >> MODULE_PARM_DESC(resetwaittime, "Wait time in seconds after I/O timeout " >>"before resetting adapter. Default: 180"); >> >> +int smp_affinity_enable = 1; >> +module_param(smp_affinity_enable, int, S_IRUGO); >> +MODULE_PARM_DESC(smp_affinity_enable, "SMP affinity feature enable/disbale >> Default: enable(1)"); >> + >> MODULE_LICENSE("GPL"); >> MODULE_VERSION(MEGASAS_VERSION); >> MODULE_AUTHOR("megaraidli...@lsi.com"); >> @@ -5160,8 +5164,9 @@ retry_irq_register: >> printk(KERN_DEBUG "megasas: Failed to " >> "register IRQ for vector %d.\n", i); >> for (j = 0; j < i; j++) { >> - irq_set_affinity_hint( >> - instance->msixentry[j].vector, >> NULL); >> + if (smp_affinity_enable) >> + irq_set_affinity_hint( >> + >> instance->msixentry[j].vector, NULL); >> free_irq( >> instance->msixentry[j].vector, >> &instance->irq_context[j]); >> @@ -5170,11 +5175,14 @@ retry_irq_register: >> instance->msix_vectors = 0; >> goto retry_irq_register; >> } >> - if >> (irq_set_affinity_hint(instance->msixentry[i].vector, >> - get_cpu_mask(cpu))) >> - dev_err(&instance->pdev->dev, "Error setting" >> - "affinity hint for cpu %d\n", cpu); >> - cpu = cpumask_next(cpu, cpu_online_mask); >> + if (smp_affinity_enable) { >> + if >> (irq_set_affinity_hint(instance->msixentry[i].vector, >> + get_cpu_mask(cpu))) >> + dev_err(&instance->pdev->dev, >> + "Error setting affinity hint " >> + "for cpu %d\n", cpu); >> + cpu = cpumask_next(cpu, cpu_online_mask); >> + } >> } >> } else { >> instance->irq_context[0].instance = instance; >> @@ -5233,8 +5241,9 @@ retry_irq_register: >> instance->instancet->disable_intr(instance); >> if (instance->msix_vectors) >> for (i = 0; i < instance->msix_vectors; i++) { >> - irq_set_affinity_hint( >> - instance->msixentry[i].vector, NULL); >> + if (smp_affinity_enable) >> + irq_set_affinity_hint( >> + instance->msixentry[i].vector, NULL); >> free_irq(instance->msixentry[i].vector, >>&instance->irq_context[i]); >> } >&
Re: [PATCH 05/11] megaraid_sas : Extended VD support
On Wed, Sep 10, 2014 at 7:20 PM, Tomas Henzl wrote: > > On 09/06/2014 03:25 PM, sumit.sax...@avagotech.com wrote: > > Current MegaRAID firmware and hence the driver only supported 64VDs. > > E.g: If the user wants to create more than 64VD on a controller, > > it is not possible on current firmware/driver. > > > > New feature and requirement to support upto 256VD, firmware/driver/apps > > need changes. > > In addition to that there must be a backward compatibility of the new > > driver with the > > older firmware and vice versa. > > > > RAID map is the interface between Driver and FW to fetch all required > > fields(attributes) for each Virtual Drives. > > In the earlier design driver was using the FW copy of RAID map where as > > in the new design the Driver will keep the RAID map copy of its own; on > > which > > it will operate for any raid map access in fast path. > > > > Local driver raid map copy will provide ease of access through out the code > > and provide generic interface for future FW raid map changes. > > > > For the backward compatibility driver will notify FW that it supports 256VD > > to the FW in driver capability field. > > Based on the controller properly returned by the FW, the Driver will know > > whether it supports 256VD or not and will copy the RAID map accordingly. > > > > At any given time, driver will always have old or new Raid map. > > So with this changes, driver can also work in host lock less mode. Please > > see next patch which enable host lock less mode for megaraid_sas driver. > > > > Signed-off-by: Sumit Saxena > > Signed-off-by: Kashyap Desai > > --- > > drivers/scsi/megaraid/megaraid_sas.h| 73 +++--- > > drivers/scsi/megaraid/megaraid_sas_base.c | 205 > > > > drivers/scsi/megaraid/megaraid_sas_fp.c | 195 > > ++ > > drivers/scsi/megaraid/megaraid_sas_fusion.c | 118 > > drivers/scsi/megaraid/megaraid_sas_fusion.h | 95 - > > 5 files changed, 502 insertions(+), 184 deletions(-) > > > > diff --git a/drivers/scsi/megaraid/megaraid_sas.h > > b/drivers/scsi/megaraid/megaraid_sas.h > > index e0f03e2..5dedf09 100644 > > --- a/drivers/scsi/megaraid/megaraid_sas.h > > +++ b/drivers/scsi/megaraid/megaraid_sas.h > > @@ -390,7 +390,6 @@ enum MR_LD_QUERY_TYPE { > > #define MR_EVT_FOREIGN_CFG_IMPORTED 0x00db > > #define MR_EVT_LD_OFFLINE 0x00fc > > #define MR_EVT_CTRL_HOST_BUS_SCAN_REQUESTED 0x0152 > > -#define MAX_LOGICAL_DRIVES 64 > > > > enum MR_PD_STATE { > > MR_PD_STATE_UNCONFIGURED_GOOD = 0x00, > > @@ -468,14 +467,14 @@ struct MR_LD_LIST { > > u8 state; > > u8 reserved[3]; > > u64 size; > > - } ldList[MAX_LOGICAL_DRIVES]; > > + } ldList[MAX_LOGICAL_DRIVES_EXT]; > > } __packed; > > > > struct MR_LD_TARGETID_LIST { > > u32 size; > > u32 count; > > u8 pad[3]; > > - u8 targetId[MAX_LOGICAL_DRIVES]; > > + u8 targetId[MAX_LOGICAL_DRIVES_EXT]; > > }; > > > > > > @@ -941,6 +940,15 @@ struct megasas_ctrl_info { > > * HA cluster information > > */ > > struct { > > +#if defined(__BIG_ENDIAN_BITFIELD) > > + u32 reserved:26; > > + u32 premiumFeatureMismatch:1; > > + u32 ctrlPropIncompatible:1; > > + u32 fwVersionMismatch:1; > > + u32 hwIncompatible:1; > > + u32 peerIsIncompatible:1; > > + u32 peerIsPresent:1; > > +#else > > u32 peerIsPresent:1; > > u32 peerIsIncompatible:1; > > u32 hwIncompatible:1; > > @@ -948,6 +956,7 @@ struct megasas_ctrl_info { > > u32 ctrlPropIncompatible:1; > > u32 premiumFeatureMismatch:1; > > u32 reserved:26; > > +#endif > > } cluster; > > > > char clusterId[16]; /*7D4h */ > > @@ -962,9 +971,17 @@ struct megasas_ctrl_info { > > #if defined(__BIG_ENDIAN_BITFIELD) > > u32 reserved:25; > > u32 supportCrashDump:1; > > - u32 reserved1:6; > > + u32 supportMaxExtLDs:1; > > +
Re: [PATCH 10/11] megaraid_sas : MFI MPT linked list corruption fix
On Wed, Sep 10, 2014 at 8:36 PM, Tomas Henzl wrote: > On 09/06/2014 03:25 PM, sumit.sax...@avagotech.com wrote: >> Problem statement: >> MFI link list in megaraid_sas driver is used from mfi-mpt pass-through >> commands. >> This list can be corrupted due to many possible race conditions in driver and >> eventually we may see kernel panic. >> >> One example - >> MFI frame is freed from calling process as driver send command via polling >> method and interrupt >> for that command comes after driver free mfi frame (actually even after some >> other context reuse >> the mfi frame). When driver receive MPT frame in ISR, driver will be using >> the index of MFI and >> access that MFI frame and finally in-used MFI frame’s list will be corrupted. >> >> High level description of new solution - >> Free MFI and MPT command from same context. >> Free both the command either from process (from where mfi-mpt pass-through >> was called) or from >> ISR context. Do not split freeing of MFI and MPT, because it creates the >> race condition which >> will do MFI/MPT list corruption. >> >> Renamed the cmd_pool_lock which is used in instance as well as fusion with >> below name. >> mfi_pool_lock and mpt_pool_lock to add more code readability. >> >> Signed-off-by: Sumit Saxena >> Signed-off-by: Kashyap Desai >> --- >> drivers/scsi/megaraid/megaraid_sas.h| 25 +++- >> drivers/scsi/megaraid/megaraid_sas_base.c | 196 >> >> drivers/scsi/megaraid/megaraid_sas_fusion.c | 95 ++ >> drivers/scsi/megaraid/megaraid_sas_fusion.h | 2 +- >> 4 files changed, 235 insertions(+), 83 deletions(-) >> >> diff --git a/drivers/scsi/megaraid/megaraid_sas.h >> b/drivers/scsi/megaraid/megaraid_sas.h >> index 156d4b9..f99db18 100644 >> --- a/drivers/scsi/megaraid/megaraid_sas.h >> +++ b/drivers/scsi/megaraid/megaraid_sas.h >> @@ -1016,6 +1016,12 @@ struct megasas_ctrl_info { >> >> #define VD_EXT_DEBUG 0 >> >> +enum MR_MFI_MPT_PTHR_FLAGS { >> + MFI_MPT_DETACHED = 0, >> + MFI_LIST_ADDED = 1, >> + MFI_MPT_ATTACHED = 2, >> +}; >> + >> /* Frame Type */ >> #define IO_FRAME 0 >> #define PTHRU_FRAME 1 >> @@ -1033,7 +1039,7 @@ struct megasas_ctrl_info { >> #define MEGASAS_IOCTL_CMD0 >> #define MEGASAS_DEFAULT_CMD_TIMEOUT 90 >> #define MEGASAS_THROTTLE_QUEUE_DEPTH 16 >> - >> +#define MEGASAS_BLOCKED_CMD_TIMEOUT 60 >> /* >> * FW reports the maximum of number of commands that it can accept (maximum >> * commands that can be outstanding) at any time. The driver must report a >> @@ -1652,7 +1658,7 @@ struct megasas_instance { >> struct megasas_cmd **cmd_list; >> struct list_head cmd_pool; >> /* used to sync fire the cmd to fw */ >> - spinlock_t cmd_pool_lock; >> + spinlock_t mfi_pool_lock; >> /* used to sync fire the cmd to fw */ >> spinlock_t hba_lock; >> /* used to synch producer, consumer ptrs in dpc */ >> @@ -1839,6 +1845,11 @@ struct megasas_cmd { >> >> struct list_head list; >> struct scsi_cmnd *scmd; >> + >> + void *mpt_pthr_cmd_blocked; >> + atomic_t mfi_mpt_pthr; >> + u8 is_wait_event; >> + >> struct megasas_instance *instance; >> union { >> struct { >> @@ -1927,4 +1938,14 @@ int megasas_set_crash_dump_params(struct >> megasas_instance *instance, >> void megasas_free_host_crash_buffer(struct megasas_instance *instance); >> void megasas_fusion_crash_dump_wq(struct work_struct *work); >> >> +void megasas_return_cmd_fusion(struct megasas_instance *instance, >> + struct megasas_cmd_fusion *cmd); >> +int megasas_issue_blocked_cmd(struct megasas_instance *instance, >> + struct megasas_cmd *cmd, int timeout); >> +void __megasas_return_cmd(struct megasas_instance *instance, >> + struct megasas_cmd *cmd); >> + >> +void megasas_return_mfi_mpt_pthr(struct megasas_instance *instance, >> + struct megasas_cmd *cmd_mfi, struct megasas_cmd_fusion *cmd_fusion); >> + >> #endif /*LSI_MEGARAID_SAS_H */ >> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c >> b/drivers/scsi/megaraid/megaraid_sas_base.c >> index 086beee..50d69eb 100644 >> --- a/drivers/scsi/megaraid/megaraid_sas_base.c >> +++ b/drivers/scsi/meg
RE: [PATCH 04/11] megaraid_sas : Firmware crash dump feature support
> -Original Message- > From: Vivek Goyal [mailto:vgo...@redhat.com] > Sent: Wednesday, September 10, 2014 9:17 PM > To: Tomas Henzl > Cc: Elliott, Robert (Server Storage); Sumit Saxena; linux-scsi@vger.kernel.org; > martin.peter...@oracle.com; h...@infradead.org; > jbottom...@parallels.com; Kashyap Desai; aradf...@gmail.com; Michal > Schmidt; am...@mellanox.com; Jens Axboe > (ax...@kernel.dk); scame...@beardog.cce.hp.com > Subject: Re: [PATCH 04/11] megaraid_sas : Firmware crash dump feature > support > > On Wed, Sep 10, 2014 at 05:28:40PM +0200, Tomas Henzl wrote: > > On 09/10/2014 05:06 PM, Elliott, Robert (Server Storage) wrote: > > >> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- > > >> ow...@vger.kernel.org] On Behalf Of Sumit Saxena > > >> > > >>> From: Tomas Henzl [mailto:the...@redhat.com] > > >>> > > >>> With several controllers in a system this may take a lot memory, > > >>> could you also in case when a kdump kernel is running lower it, by > > >>> not using this feature? > > >>> > > >> Agreed, we will disable this feature for kdump kernel by adding > > >> "reset_devices" global varaiable. > > >> That check is required for only one place, throughout the code, > > >> this feature will remain disabled. Code snippet for the same- > > >> > > >> instance->crash_dump_drv_support = (!reset_devices) && > > >> crashdump_enable && > > >> instance->crash_dump_fw_support && > > >> instance->crash_dump_buf); > > >> if(instance->crash_dump_drv_support) { > > >> printk(KERN_INFO "megaraid_sas: FW Crash dump is > > >> supported\n"); > > >> megasas_set_crash_dump_params(instance, > > >> MR_CRASH_BUF_TURN_OFF); > > >> > > >> } else { > > >> .. > > >> } > > > Network drivers have been running into similar problems. > > > > > > There's a new patch from Amir coming through net-next to make > > > is_kdump_kernel() (in crash_dump.h) accessible to modules. > > > That may be a better signal than reset_devices that the driver > > > should use minimal resources. > > > > > > http://comments.gmane.org/gmane.linux.network/324737 > > > > > > I'm not sure about the logistics of a SCSI patch depending on a > > > net-next patch. > > > > Probably better to start with reset_devices and switch to > > is_kdump_kernel() later. > > This is not a discussion about reset_devices versus is_kdump_kernel, > > but while it looks good to have it distinguished - is the > > reset_devices actually used anywhere else than in kdump kernel? > > I think usage of reset_devices for lowering memory footprint of driver is > plain wrong. It tells driver to only reset the device as BIOS might not have > done it right or we skipped BIOS completely. > > Using is_kdump_kernel() is also not perfect either but atleast better than > reset_devices. We will use is_kdump_kernel() not to enable this feature in Kdump kernel. MegaRaid Driver will not allocate Host memory (which is used to collect complete FW crash image) until and unless associated FW is crashed/IO timeout. Driver do allocation only if required. Also, it will break the function immediately after memory allocation is failed and operate on available memory. To be on safer side, we can disable this feature in Kdump mode. ~ Kashyap > > Thanks > Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/11] megaraid_sas : Firmware crash dump feature support
On Thu, Sep 11, 2014 at 4:50 PM, Tomas Henzl wrote: > On 09/11/2014 11:02 AM, Kashyap Desai wrote: >>> -Original Message- >>> From: Vivek Goyal [mailto:vgo...@redhat.com] >>> Sent: Wednesday, September 10, 2014 9:17 PM >>> To: Tomas Henzl >>> Cc: Elliott, Robert (Server Storage); Sumit Saxena; >> linux-scsi@vger.kernel.org; >>> martin.peter...@oracle.com; h...@infradead.org; >>> jbottom...@parallels.com; Kashyap Desai; aradf...@gmail.com; Michal >>> Schmidt; am...@mellanox.com; Jens Axboe >>> (ax...@kernel.dk); scame...@beardog.cce.hp.com >>> Subject: Re: [PATCH 04/11] megaraid_sas : Firmware crash dump feature >>> support >>> >>> On Wed, Sep 10, 2014 at 05:28:40PM +0200, Tomas Henzl wrote: >>>> On 09/10/2014 05:06 PM, Elliott, Robert (Server Storage) wrote: >>>>>> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- >>>>>> ow...@vger.kernel.org] On Behalf Of Sumit Saxena >>>>>> >>>>>>> From: Tomas Henzl [mailto:the...@redhat.com] >>>>>>> >>>>>>> With several controllers in a system this may take a lot memory, >>>>>>> could you also in case when a kdump kernel is running lower it, by >>>>>>> not using this feature? >>>>>>> >>>>>> Agreed, we will disable this feature for kdump kernel by adding >>>>>> "reset_devices" global varaiable. >>>>>> That check is required for only one place, throughout the code, >>>>>> this feature will remain disabled. Code snippet for the same- >>>>>> >>>>>> instance->crash_dump_drv_support = (!reset_devices) && >>>>>> crashdump_enable && >>>>>> instance->crash_dump_fw_support && >>>>>> instance->crash_dump_buf); >>>>>> if(instance->crash_dump_drv_support) { >>>>>> printk(KERN_INFO "megaraid_sas: FW Crash dump is >>>>>> supported\n"); >>>>>> megasas_set_crash_dump_params(instance, >>>>>> MR_CRASH_BUF_TURN_OFF); >>>>>> >>>>>> } else { >>>>>> .. >>>>>> } >>>>> Network drivers have been running into similar problems. >>>>> >>>>> There's a new patch from Amir coming through net-next to make >>>>> is_kdump_kernel() (in crash_dump.h) accessible to modules. >>>>> That may be a better signal than reset_devices that the driver >>>>> should use minimal resources. >>>>> >>>>> http://comments.gmane.org/gmane.linux.network/324737 >>>>> >>>>> I'm not sure about the logistics of a SCSI patch depending on a >>>>> net-next patch. >>>> Probably better to start with reset_devices and switch to >>>> is_kdump_kernel() later. >>>> This is not a discussion about reset_devices versus is_kdump_kernel, >>>> but while it looks good to have it distinguished - is the >>>> reset_devices actually used anywhere else than in kdump kernel? >>> I think usage of reset_devices for lowering memory footprint of driver >> is >>> plain wrong. It tells driver to only reset the device as BIOS might not >> have >>> done it right or we skipped BIOS completely. >>> >>> Using is_kdump_kernel() is also not perfect either but atleast better >> than >>> reset_devices. >> We will use is_kdump_kernel() not to enable this feature in Kdump kernel. > > OK, just keep in mind that the is_kdump_kernel() is not yet in mainline. Sure I read that part, but when I check kernel source I found is_kdump_kernel() is available. http://lxr.free-electrons.com/source/include/linux/crash_dump.h#L55 Let's do one thing - We will submit a separate patch for this to avoid any confusion. > >> >> MegaRaid Driver will not allocate Host memory (which is used to collect >> complete FW crash image) until and unless associated FW is crashed/IO >> timeout. > > This is new for the driver? A previous reply stated that the driver will > allocate 1MB for each MR controller at the 'start of the day'. > If I misunderstood it somehow then nothing is needed. This is correct. 1MB DMA buffer per controller will be allocated irrespective of FW is crashed or not. When FW actually crash driver will go and try to allocate upto 512MB memory. I though you queried for 512 MB memory. > >> Driver do allocation only if required. Also, it will break the function >> immediately after memory allocation is failed and operate on available >> memory. >> >> To be on safer side, we can disable this feature in Kdump mode. >> >> ~ Kashyap >>> Thanks >>> Vivek >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Device Driver Developer @ Avagotech Kashyap D. Desai Note - my new email address kashyap.de...@avagotech.com -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/11] megaraid_sas : MFI MPT linked list corruption fix
On Thu, Sep 11, 2014 at 5:53 PM, Tomas Henzl wrote: > On 09/11/2014 04:48 AM, Kashyap Desai wrote: >> On Wed, Sep 10, 2014 at 8:36 PM, Tomas Henzl wrote: >>> On 09/06/2014 03:25 PM, sumit.sax...@avagotech.com wrote: >>>> Problem statement: >>>> MFI link list in megaraid_sas driver is used from mfi-mpt pass-through >>>> commands. >>>> This list can be corrupted due to many possible race conditions in driver >>>> and >>>> eventually we may see kernel panic. >>>> >>>> One example - >>>> MFI frame is freed from calling process as driver send command via polling >>>> method and interrupt >>>> for that command comes after driver free mfi frame (actually even after >>>> some other context reuse >>>> the mfi frame). When driver receive MPT frame in ISR, driver will be using >>>> the index of MFI and >>>> access that MFI frame and finally in-used MFI frame’s list will be >>>> corrupted. >>>> >>>> High level description of new solution - >>>> Free MFI and MPT command from same context. >>>> Free both the command either from process (from where mfi-mpt pass-through >>>> was called) or from >>>> ISR context. Do not split freeing of MFI and MPT, because it creates the >>>> race condition which >>>> will do MFI/MPT list corruption. >>>> >>>> Renamed the cmd_pool_lock which is used in instance as well as fusion with >>>> below name. >>>> mfi_pool_lock and mpt_pool_lock to add more code readability. >>>> >>>> Signed-off-by: Sumit Saxena >>>> Signed-off-by: Kashyap Desai >>>> --- >>>> drivers/scsi/megaraid/megaraid_sas.h| 25 +++- >>>> drivers/scsi/megaraid/megaraid_sas_base.c | 196 >>>> >>>> drivers/scsi/megaraid/megaraid_sas_fusion.c | 95 ++ >>>> drivers/scsi/megaraid/megaraid_sas_fusion.h | 2 +- >>>> 4 files changed, 235 insertions(+), 83 deletions(-) >>>> >>>> diff --git a/drivers/scsi/megaraid/megaraid_sas.h >>>> b/drivers/scsi/megaraid/megaraid_sas.h >>>> index 156d4b9..f99db18 100644 >>>> --- a/drivers/scsi/megaraid/megaraid_sas.h >>>> +++ b/drivers/scsi/megaraid/megaraid_sas.h >>>> @@ -1016,6 +1016,12 @@ struct megasas_ctrl_info { >>>> >>>> #define VD_EXT_DEBUG 0 >>>> >>>> +enum MR_MFI_MPT_PTHR_FLAGS { >>>> + MFI_MPT_DETACHED = 0, >>>> + MFI_LIST_ADDED = 1, >>>> + MFI_MPT_ATTACHED = 2, >>>> +}; >>>> + >>>> /* Frame Type */ >>>> #define IO_FRAME 0 >>>> #define PTHRU_FRAME 1 >>>> @@ -1033,7 +1039,7 @@ struct megasas_ctrl_info { >>>> #define MEGASAS_IOCTL_CMD0 >>>> #define MEGASAS_DEFAULT_CMD_TIMEOUT 90 >>>> #define MEGASAS_THROTTLE_QUEUE_DEPTH 16 >>>> - >>>> +#define MEGASAS_BLOCKED_CMD_TIMEOUT 60 >>>> /* >>>> * FW reports the maximum of number of commands that it can accept >>>> (maximum >>>> * commands that can be outstanding) at any time. The driver must report a >>>> @@ -1652,7 +1658,7 @@ struct megasas_instance { >>>> struct megasas_cmd **cmd_list; >>>> struct list_head cmd_pool; >>>> /* used to sync fire the cmd to fw */ >>>> - spinlock_t cmd_pool_lock; >>>> + spinlock_t mfi_pool_lock; >>>> /* used to sync fire the cmd to fw */ >>>> spinlock_t hba_lock; >>>> /* used to synch producer, consumer ptrs in dpc */ >>>> @@ -1839,6 +1845,11 @@ struct megasas_cmd { >>>> >>>> struct list_head list; >>>> struct scsi_cmnd *scmd; >>>> + >>>> + void *mpt_pthr_cmd_blocked; >>>> + atomic_t mfi_mpt_pthr; >>>> + u8 is_wait_event; >>>> + >>>> struct megasas_instance *instance; >>>> union { >>>> struct { >>>> @@ -1927,4 +1938,14 @@ int megasas_set_crash_dump_params(struct >>>> megasas_instance *instance, >>>> void megasas_free_host_crash_buffer(struct megasas_instance *instance); >>>> void megasas_fusion_crash_dump_wq(struct work
Re: [PATCH 04/11] megaraid_sas : Firmware crash dump feature support
On Fri, Sep 12, 2014 at 12:28 AM, Tomas Henzl wrote: > On 09/11/2014 06:39 PM, Kashyap Desai wrote: >> On Thu, Sep 11, 2014 at 4:50 PM, Tomas Henzl wrote: >>> On 09/11/2014 11:02 AM, Kashyap Desai wrote: >>>>> -Original Message- >>>>> From: Vivek Goyal [mailto:vgo...@redhat.com] >>>>> Sent: Wednesday, September 10, 2014 9:17 PM >>>>> To: Tomas Henzl >>>>> Cc: Elliott, Robert (Server Storage); Sumit Saxena; >>>> linux-scsi@vger.kernel.org; >>>>> martin.peter...@oracle.com; h...@infradead.org; >>>>> jbottom...@parallels.com; Kashyap Desai; aradf...@gmail.com; Michal >>>>> Schmidt; am...@mellanox.com; Jens Axboe >>>>> (ax...@kernel.dk); scame...@beardog.cce.hp.com >>>>> Subject: Re: [PATCH 04/11] megaraid_sas : Firmware crash dump feature >>>>> support >>>>> >>>>> On Wed, Sep 10, 2014 at 05:28:40PM +0200, Tomas Henzl wrote: >>>>>> On 09/10/2014 05:06 PM, Elliott, Robert (Server Storage) wrote: >>>>>>>> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi- >>>>>>>> ow...@vger.kernel.org] On Behalf Of Sumit Saxena >>>>>>>> >>>>>>>>> From: Tomas Henzl [mailto:the...@redhat.com] >>>>>>>>> >>>>>>>>> With several controllers in a system this may take a lot memory, >>>>>>>>> could you also in case when a kdump kernel is running lower it, by >>>>>>>>> not using this feature? >>>>>>>>> >>>>>>>> Agreed, we will disable this feature for kdump kernel by adding >>>>>>>> "reset_devices" global varaiable. >>>>>>>> That check is required for only one place, throughout the code, >>>>>>>> this feature will remain disabled. Code snippet for the same- >>>>>>>> >>>>>>>> instance->crash_dump_drv_support = (!reset_devices) && >>>>>>>> crashdump_enable && >>>>>>>> instance->crash_dump_fw_support && >>>>>>>> instance->crash_dump_buf); >>>>>>>> if(instance->crash_dump_drv_support) { >>>>>>>> printk(KERN_INFO "megaraid_sas: FW Crash dump is >>>>>>>> supported\n"); >>>>>>>> megasas_set_crash_dump_params(instance, >>>>>>>> MR_CRASH_BUF_TURN_OFF); >>>>>>>> >>>>>>>> } else { >>>>>>>> .. >>>>>>>> } >>>>>>> Network drivers have been running into similar problems. >>>>>>> >>>>>>> There's a new patch from Amir coming through net-next to make >>>>>>> is_kdump_kernel() (in crash_dump.h) accessible to modules. >>>>>>> That may be a better signal than reset_devices that the driver >>>>>>> should use minimal resources. >>>>>>> >>>>>>> http://comments.gmane.org/gmane.linux.network/324737 >>>>>>> >>>>>>> I'm not sure about the logistics of a SCSI patch depending on a >>>>>>> net-next patch. >>>>>> Probably better to start with reset_devices and switch to >>>>>> is_kdump_kernel() later. >>>>>> This is not a discussion about reset_devices versus is_kdump_kernel, >>>>>> but while it looks good to have it distinguished - is the >>>>>> reset_devices actually used anywhere else than in kdump kernel? >>>>> I think usage of reset_devices for lowering memory footprint of driver >>>> is >>>>> plain wrong. It tells driver to only reset the device as BIOS might not >>>> have >>>>> done it right or we skipped BIOS completely. >>>>> >>>>> Using is_kdump_kernel() is also not perfect either but atleast better >>>> than >>>>> reset_devices. >>>> We will use is_kdump_kernel() not to enable this feature in Kdump kernel. >>> OK, just keep in mind that the is_kdump_kernel() is not yet in mainline. >> Sure I read that part, but when I check kernel source I found >> is_kdump_kernel() is available. >&
RE: [PATCH 10/11] megaraid_sas : MFI MPT linked list corruption fix
> -Original Message- > From: Tomas Henzl [mailto:the...@redhat.com] > Sent: Friday, September 12, 2014 12:50 AM > To: Kashyap Desai > Cc: Sumit Saxena; linux-scsi@vger.kernel.org; Martin K. Petersen; > Christoph > Hellwig; jbottom...@parallels.com; aradford > Subject: Re: [PATCH 10/11] megaraid_sas : MFI MPT linked list corruption > fix > > On 09/11/2014 08:41 PM, Kashyap Desai wrote: > > On Thu, Sep 11, 2014 at 5:53 PM, Tomas Henzl > wrote: > >> On 09/11/2014 04:48 AM, Kashyap Desai wrote: > >>> On Wed, Sep 10, 2014 at 8:36 PM, Tomas Henzl > wrote: > >>>> On 09/06/2014 03:25 PM, sumit.sax...@avagotech.com wrote: > >>>>> Problem statement: > >>>>> MFI link list in megaraid_sas driver is used from mfi-mpt > >>>>> pass-through > commands. > >>>>> This list can be corrupted due to many possible race conditions in > >>>>> driver and eventually we may see kernel panic. > >>>>> > >>>>> One example - > >>>>> MFI frame is freed from calling process as driver send command via > >>>>> polling method and interrupt for that command comes after driver > >>>>> free mfi frame (actually even after some other context reuse the > >>>>> mfi frame). When driver receive MPT frame in ISR, driver will be > >>>>> using > the index of MFI and access that MFI frame and finally in-used MFI frame’s > list will be corrupted. > >>>>> > >>>>> High level description of new solution - Free MFI and MPT command > >>>>> from same context. > >>>>> Free both the command either from process (from where mfi-mpt > >>>>> pass-through was called) or from ISR context. Do not split freeing > >>>>> of MFI and MPT, because it creates the race condition which will do > MFI/MPT list corruption. > >>>>> > >>>>> Renamed the cmd_pool_lock which is used in instance as well as > fusion with below name. > >>>>> mfi_pool_lock and mpt_pool_lock to add more code readability. > >>>>> > >>>>> Signed-off-by: Sumit Saxena > >>>>> Signed-off-by: Kashyap Desai > >>>>> --- > >>>>> drivers/scsi/megaraid/megaraid_sas.h| 25 +++- > >>>>> drivers/scsi/megaraid/megaraid_sas_base.c | 196 > > >>>>> drivers/scsi/megaraid/megaraid_sas_fusion.c | 95 ++ > >>>>> drivers/scsi/megaraid/megaraid_sas_fusion.h | 2 +- > >>>>> 4 files changed, 235 insertions(+), 83 deletions(-) > >>>>> > >>>>> diff --git a/drivers/scsi/megaraid/megaraid_sas.h > >>>>> b/drivers/scsi/megaraid/megaraid_sas.h > >>>>> index 156d4b9..f99db18 100644 > >>>>> --- a/drivers/scsi/megaraid/megaraid_sas.h > >>>>> +++ b/drivers/scsi/megaraid/megaraid_sas.h > >>>>> @@ -1016,6 +1016,12 @@ struct megasas_ctrl_info { > >>>>> > >>>>> #define VD_EXT_DEBUG 0 > >>>>> > >>>>> +enum MR_MFI_MPT_PTHR_FLAGS { > >>>>> + MFI_MPT_DETACHED = 0, > >>>>> + MFI_LIST_ADDED = 1, > >>>>> + MFI_MPT_ATTACHED = 2, > >>>>> +}; > >>>>> + > >>>>> /* Frame Type */ > >>>>> #define IO_FRAME 0 > >>>>> #define PTHRU_FRAME 1 > >>>>> @@ -1033,7 +1039,7 @@ struct megasas_ctrl_info { > >>>>> #define MEGASAS_IOCTL_CMD0 > >>>>> #define MEGASAS_DEFAULT_CMD_TIMEOUT 90 > >>>>> #define MEGASAS_THROTTLE_QUEUE_DEPTH 16 > >>>>> - > >>>>> +#define MEGASAS_BLOCKED_CMD_TIMEOUT 60 > >>>>> /* > >>>>> * FW reports the maximum of number of commands that it can > accept (maximum > >>>>> * commands that can be outstanding) at any time. The driver must > >>>>> report a @@ -1652,7 +1658,7 @@ struct megasas_instance { > >>>>> struct megasas_cmd **cmd_list; > >>>>> struct list_head cmd_pool; > >>>>> /* used to sync fire the cmd to fw */ > >>>>> - spinlock_t cmd_pool_lock; > >>>>> + spinlock_t mfi_pool_lock; > >>>&
RE: [PATCH] scsi: megaraid_sas: Prevent future %p disaster
Acked-by: Kashyap Desai > -Original Message- > From: Rasmus Villemoes [mailto:li...@rasmusvillemoes.dk] > Sent: Friday, February 06, 2015 8:04 PM > To: Kashyap Desai; Sumit Saxena; Uday Lingala; James E.J. Bottomley > Cc: Rasmus Villemoes; megaraidlinux@avagotech.com; linux- > s...@vger.kernel.org; linux-ker...@vger.kernel.org > Subject: [PATCH] scsi: megaraid_sas: Prevent future %p disaster > > There is currently no %po format extension, so currently the letters "on" are > simply skipped and the pointer is printed as expected (while missing the word > on). However, it is probably only a matter of time before someone comes up > with a %po extension, at which point this is likely to fail spectacularly. > > Signed-off-by: Rasmus Villemoes > --- > drivers/scsi/megaraid/megaraid_sas_base.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > b/drivers/scsi/megaraid/megaraid_sas_base.c > index ff283d23788a..e7c6b9c946d6 100644 > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > @@ -3105,7 +3105,7 @@ megasas_internal_reset_defer_cmds(struct > megasas_instance *instance) > for (i = 0; i < max_cmd; i++) { > cmd = instance->cmd_list[i]; > if (cmd->sync_cmd == 1 || cmd->scmd) { > - printk(KERN_NOTICE "megasas: moving > cmd[%d]:%p:%d:%p" > + printk(KERN_NOTICE "megasas: moving > cmd[%d]:%p:%d:%p " > "on the defer queue as internal\n", > defer_index, cmd, cmd->sync_cmd, cmd- > >scmd); > > -- > 2.1.3 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Scsi Error handling query
Hi Hannes, I was going through one of the slide posted at below link. http://events.linuxfoundation.org/sites/events/files/slides/SCSI-EH.pdf Slide #59 has below data. I was trying to correlate with latest upstream code, but do not understand few things. Does Linux handle blocking I/O to the device and target before it actually start legacy EH recovery ? Also, how does linux scsi stack achieve task set abort ? Proposed SCSI EH strategy • Send command aborts after timeout • EH Recovery starts: ‒ Block I/O to the device ‒ Issue 'Task Set Abort' ‒ Block I/O to the target ‒ Issue I_T Nexus Reset ‒ Complete outstanding command on success ‒ Engage current EH strategy ‒ LUN Reset, Target Reset etc Thanks, Kashyap -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Scsi Error handling query
> -Original Message- > From: Hannes Reinecke [mailto:h...@suse.de] > Sent: Thursday, March 26, 2015 9:28 PM > To: Kashyap Desai; linux-scsi@vger.kernel.org > Subject: Re: Scsi Error handling query > > On 03/26/2015 02:38 PM, Kashyap Desai wrote: > > Hi Hannes, > > > > I was going through one of the slide posted at below link. > > > > http://events.linuxfoundation.org/sites/events/files/slides/SCSI-EH.pd > > f > > > > Slide #59 has below data. I was trying to correlate with latest > > upstream code, but do not understand few things. Does Linux handle > > blocking I/O to the device and target before it actually start legacy EH > recovery ? > > Yes. This is handled by 'scsi_eh_scmd_add()', which adds the command to > the > internal 'eh_entry' list and starts recovery once all remaining > outstanding > commands are completed. Thanks Hannes..! Scsi_eh_scmd_add() move shost state to recovery, so it means blocking further IO to the Host and not really a limited to Device/Target for which command was timed out. Right ? I understood that, new improvement of scsi error handling will allow IOs to the other Devices attached to the host except the IO belongs to specific target. Also, one more thing to clarify... In presentation, term "task set aborts" was used. Does this mean task set abort is handled as traversing complete list of timed out command and sending individual TASK ABORT ? > > > Also, how does linux scsi stack achieve task set abort ? > > > Currently we don't :-) > The presentation was a roadmap about future EH updates. > > > Proposed SCSI EH strategy > > • Send command aborts after timeout > > • EH Recovery starts: > > ‒ Block I/O to the device > >‒ Issue 'Task Set Abort' > > ‒ Block I/O to the target > >‒ Issue I_T Nexus Reset > >‒ Complete outstanding command on success ‒ Engage current EH > > strategy > >‒ LUN Reset, Target Reset etc > > > The current plans for EH updates are: > > - Convert eh_host_reset_handler() to take Scsi_Host as argument > - Convert EH host reset to do a host rescan after try_host_reset() > succeeded > - Terminate failed scmds prior to calling try_host_reset() > => with that we should be able to instantiate a quick failover > when running under multipathing, as then I/Os will be returned > prior to the host reset (which is know to take quite a long > time) > > - Convert the remaining eh_XXX_reset_handler() to take the > appropriate structure as argument. > This will require some work, as some EH handler implementation > re-use the command tag (or even the actual command) for sending > TMFs. > > - Implementing a 'transport reset' EH function; to be called > after the current EH LUN Reset > > - Investigating the possibilty for an asynchronous 'task set abort', > and make the 'transport reset' EH function asynchronous, too. > > I've got a patchset for the first step, but the others still require some > work ... > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke zSeries & Storage > h...@suse.de +49 911 74053 688 > SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 > (AG > Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Scsi Error handling query
> -Original Message- > From: Hannes Reinecke [mailto:h...@suse.de] > Sent: Friday, March 27, 2015 9:32 PM > To: Kashyap Desai; linux-scsi@vger.kernel.org > Subject: Re: Scsi Error handling query > > On 03/26/2015 07:43 PM, Kashyap Desai wrote: > >> -Original Message- > >> From: Hannes Reinecke [mailto:h...@suse.de] > >> Sent: Thursday, March 26, 2015 9:28 PM > >> To: Kashyap Desai; linux-scsi@vger.kernel.org > >> Subject: Re: Scsi Error handling query > >> > >> On 03/26/2015 02:38 PM, Kashyap Desai wrote: > >>> Hi Hannes, > >>> > >>> I was going through one of the slide posted at below link. > >>> > >>> http://events.linuxfoundation.org/sites/events/files/slides/SCSI-EH. > >>> pd > >>> f > >>> > >>> Slide #59 has below data. I was trying to correlate with latest > >>> upstream code, but do not understand few things. Does Linux handle > >>> blocking I/O to the device and target before it actually start > >>> legacy EH > >> recovery ? > >> > >> Yes. This is handled by 'scsi_eh_scmd_add()', which adds the command > >> to the internal 'eh_entry' list and starts recovery once all > >> remaining outstanding commands are completed. > > > > Thanks Hannes..! Scsi_eh_scmd_add() move shost state to recovery, so > > it means blocking further IO to the Host and not really a limited to > > Device/Target for which command was timed out. Right ? > > I understood that, new improvement of scsi error handling will allow > > IOs to the other Devices attached to the host except the IO belongs to > > specific target. > > > > Also, one more thing to clarify... In presentation, term "task set > > aborts" > > was used. Does this mean task set abort is handled as traversing > > complete list of timed out command and sending individual TASK ABORT ? > > > No. The idea was to send 'task set aborts' as a single TMF. Thanks Hannes.! OK so idea was to have single TMF for "Task set abort." I am not sure how to frame my next question.. But what if Linux SCSI layer traverse each IO of one particular target and issue individual Task abort? Don’t we call that as "task set aborts" ? How LLD Driver should interface for "task set aborts" as single TMF ? My understanding is "Task set abort" will be internally converted to single Task abort either by SCSI layer or HBA FW. > However, I'm not sure if I'll be going ahead with that one; once we've > issued a > 'transport reset the commands will be cone anyway. > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke zSeries & Storage > h...@suse.de +49 911 74053 688 > SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 > (AG > Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Scsi Error handling query
> -Original Message- > From: Hannes Reinecke [mailto:h...@suse.de] > Sent: Monday, March 30, 2015 8:43 PM > To: Kashyap Desai; linux-scsi@vger.kernel.org > Subject: Re: Scsi Error handling query > > On 03/30/2015 01:45 PM, Kashyap Desai wrote: > >> -Original Message- > >> From: Hannes Reinecke [mailto:h...@suse.de] > >> Sent: Friday, March 27, 2015 9:32 PM > >> To: Kashyap Desai; linux-scsi@vger.kernel.org > >> Subject: Re: Scsi Error handling query > >> > >> On 03/26/2015 07:43 PM, Kashyap Desai wrote: > >>>> -Original Message- > >>>> From: Hannes Reinecke [mailto:h...@suse.de] > >>>> Sent: Thursday, March 26, 2015 9:28 PM > >>>> To: Kashyap Desai; linux-scsi@vger.kernel.org > >>>> Subject: Re: Scsi Error handling query > >>>> > >>>> On 03/26/2015 02:38 PM, Kashyap Desai wrote: > >>>>> Hi Hannes, > >>>>> > >>>>> I was going through one of the slide posted at below link. > >>>>> > >>>>> http://events.linuxfoundation.org/sites/events/files/slides/SCSI-EH. > >>>>> pd > >>>>> f > >>>>> > >>>>> Slide #59 has below data. I was trying to correlate with latest > >>>>> upstream code, but do not understand few things. Does Linux handle > >>>>> blocking I/O to the device and target before it actually start > >>>>> legacy EH > >>>> recovery ? > >>>> > >>>> Yes. This is handled by 'scsi_eh_scmd_add()', which adds the > >>>> command to the internal 'eh_entry' list and starts recovery once > >>>> all remaining outstanding commands are completed. > >>> > >>> Thanks Hannes..! Scsi_eh_scmd_add() move shost state to recovery, so > >>> it means blocking further IO to the Host and not really a limited > >>> to Device/Target for which command was timed out. Right ? > >>> I understood that, new improvement of scsi error handling will allow > >>> IOs to the other Devices attached to the host except the IO belongs > >>> to specific target. > >>> > >>> Also, one more thing to clarify... In presentation, term "task set > >>> aborts" > >>> was used. Does this mean task set abort is handled as traversing > >>> complete list of timed out command and sending individual TASK ABORT ? > >>> > >> No. The idea was to send 'task set aborts' as a single TMF. > > > > Thanks Hannes.! OK so idea was to have single TMF for "Task set abort." > > I > > am not sure how to frame my next question.. But what if Linux SCSI > > layer traverse each IO of one particular target and issue individual > > Task abort? > > Don’t we call that as "task set aborts" ? How LLD Driver should > > interface for "task set aborts" as single TMF ? My understanding is > > "Task set > abort" > > will be internally converted to single Task abort either by SCSI layer > > or HBA FW. > > > Why? There _is_ a 'task set abort' TMF defined in SAM. > If the firmware doesn't implement it I'd thought the respective command to > be > failed? Understood this part. It is single TMF of "task set abort" is what addressed here. Is there any harm to do in Low level driver as below for Lun/Target reset ? Once LLD enter into eh_abort_handler callback, try to do "Task Set Abort" from LLD as single TMF (most likely convert single "task set abort" in FW specific format) Wait for completion and see if things are really resolve. If not, issue Target Reset (from the FW) which will be SAS Link Hard Reset. This way driver can abort command related to associated I_T nexus. I initially thought Linux scsi error handling actually does this because it completely traverse all timed out command from error handling thread (which will be almost equivalent to single TMF "Task set abort") > > However, at this point I'm not sure if 'task set abort' is actually > required; it > _should_ be superseded by the new 'transport reset' EH handler. > On the FC side this will translate into a relogin, which will > automatically abort all > outstanding tasks. > SAS even has a dedicated TMF IT NEXUS LOSS, which looks like it could be > used > here. Not sure if my above comment/query is on same line. > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke zSeries & Storage > h...@suse.de +49 911 74053 688 > SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 > (AG > Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] megaraid:Make the function megasas_alloc_cmd_fusion static
> -int > +static int > megasas_alloc_cmds_fusion(struct megasas_instance *instance) { > int i, j, count; Good catch. Let's not include patch to avoid further confusion on series of acked patch which are not making significant functional difference. NACK as not making any functional difference. We can make many more such changes in one shot and let's address those later as we have few patches ready to send in few days. > -- > 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] megaraid:Make various functions static in megaraid_sas_fusion.c
> -Original Message- > From: Nicholas Krause [mailto:xerofo...@gmail.com] > Sent: Monday, July 06, 2015 10:06 PM > To: kashyap.de...@avagotech.com > Cc: sumit.sax...@avagotech.com; uday.ling...@avagotech.com; > jbottom...@odin.com; megaraidlinux@avagotech.com; linux- > s...@vger.kernel.org; linux-ker...@vger.kernel.org > Subject: [PATCH] megaraid:Make various functions static in > megaraid_sas_fusion.c > > This makes various functions that have no external callers outside of their > definition/declaration in the file megaraid_sas_fusion.c to be properly > declared as static functions now. > > Signed-off-by: Nicholas Krause Nicholas - You posted patch with subject line "megaraid:Make the function megasas_alloc_cmd_fusion static" for similar code changes. Let's have consolidated patch for this and Avago can post it with next change set (to keep git commit clean without any failure) Just to be clear - You can send us patch and we can add with Submitted by your name as signature. For now NACK. ` Kashyap -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: megaraid_sas: add an i/o barrier
> -Original Message- > From: Tomas Henzl [mailto:the...@redhat.com] > Sent: Friday, January 29, 2016 11:05 PM > To: 'linux-scsi@vger.kernel.org' > Cc: sumit.sax...@avagotech.com; Desai, Kashyap; Uday Lingala > Subject: megaraid_sas: add an i/o barrier > > A barrier should be added to ensure proper ordering of memory mapped > writes. > > Signed-off-by: Tomas Henzl > --- > drivers/scsi/megaraid/megaraid_sas_fusion.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c > b/drivers/scsi/megaraid/megaraid_sas_fusion.c > index d9d0029fb1..98a848bdfd 100644 > --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c > +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c > @@ -204,6 +204,7 @@ megasas_fire_cmd_fusion(struct > megasas_instance *instance, > &instance->reg_set->inbound_low_queue_port); > writel(le32_to_cpu(req_desc->u.high), > &instance->reg_set->inbound_high_queue_port); > + mmiowb(); > spin_unlock_irqrestore(&instance->hba_lock, flags); #endif } Tomas- We may need similar changes around below Functions as well, because there is no associated readX or mmiowb() call. megasas_fire_cmd_xscale() megasas_fire_cmd_ppc() megasas_fire_cmd_skinny() megasas_fire_cmd_gen2() Also, wrireq() routine in same function megasas_fire_cmd_fusion() need i/o barrier. > -- > 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: megaraid_sas: add an i/o barrier
> -Original Message- > From: Tomas Henzl [mailto:the...@redhat.com] > Sent: Monday, February 01, 2016 6:23 PM > To: Kashyap Desai; linux-scsi@vger.kernel.org > Cc: Sumit Saxena; Uday Lingala > Subject: Re: megaraid_sas: add an i/o barrier > > On 1.2.2016 05:45, Kashyap Desai wrote: > >> -Original Message- > >> From: Tomas Henzl [mailto:the...@redhat.com] > >> Sent: Friday, January 29, 2016 11:05 PM > >> To: 'linux-scsi@vger.kernel.org' > >> Cc: sumit.sax...@avagotech.com; Desai, Kashyap; Uday Lingala > >> Subject: megaraid_sas: add an i/o barrier > >> > >> A barrier should be added to ensure proper ordering of memory mapped > >> writes. > >> > >> Signed-off-by: Tomas Henzl > >> --- > >> drivers/scsi/megaraid/megaraid_sas_fusion.c | 1 + > >> 1 file changed, 1 insertion(+) > >> > >> diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c > >> b/drivers/scsi/megaraid/megaraid_sas_fusion.c > >> index d9d0029fb1..98a848bdfd 100644 > >> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c > >> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c > >> @@ -204,6 +204,7 @@ megasas_fire_cmd_fusion(struct > megasas_instance > >> *instance, > >>&instance->reg_set->inbound_low_queue_port); > >>writel(le32_to_cpu(req_desc->u.high), > >>&instance->reg_set->inbound_high_queue_port); > >> + mmiowb(); > >>spin_unlock_irqrestore(&instance->hba_lock, flags); #endif } > > Tomas- > > > > We may need similar changes around below Functions as well, because > > there is no associated readX or mmiowb() call. > > megasas_fire_cmd_xscale() > > megasas_fire_cmd_ppc() > > megasas_fire_cmd_skinny() > > megasas_fire_cmd_gen2() > > > > Also, wrireq() routine in same function megasas_fire_cmd_fusion() > > need i/o barrier. > > I don't think so (with the exception of megasas_fire_cmd_skinny - I missed > this one). > When two threads try to use a fire_cmd there is no protection of certain > ordering, that had to be done in a caller of fire_cmd (for example in > megasas_build_and_issue_cmd_fusion) > and it seems to me that there is nothing like that. Likely is, that this - > a > strict ordering of commands - is not needed. > The protection which I'm adding is needed when a command consist of a > sequence of more than one write, see memory-barriers.txt. > > (It looks to me that > megasas_fire_cmd_ -xscale -ppc -skiny -gen2 do not need the hba_lock > unless there is another i/o sequence protected with the same lock (note > also that there is no such lock in megasas_fire_cmd_fusion).) Agree, we don't need hba lock in older fire_cmd wrapper. We updated fusion code because it was active shipment and product life cycle was active for fusion adapter, so code changes was tested by Avago. We did not clean code in MFI based controller since it is only in critical support cycle. > > > I'll add the mmiowb to megasas_fire_cmd_skinny and send a new patch - > agreed? Got your word. You are right. Yes additional changes in megasas_fire_cmd_skinny along with current patch will be good to go. > > --tms > > > > >> -- > >> 2.4.3 > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v2] megaraid_sas: add an i/o barrier
> -Original Message- > From: Martin K. Petersen [mailto:martin.peter...@oracle.com] > Sent: Wednesday, February 10, 2016 10:55 PM > To: Tomas Henzl > Cc: 'linux-scsi@vger.kernel.org'; sumit.sax...@avagotech.com; Desai, > Kashyap; Uday Lingala; sta...@vger.kernel.org > Subject: Re: [PATCH v2] megaraid_sas: add an i/o barrier > > >>>>> "Tomas" == Tomas Henzl writes: > > Tomas> A barrier should be added to ensure proper ordering of memory > Tomas> mapped writes. > > Tomas> V2: - added the barrier also to megasas_fire_cmd_skinny, as > Tomas> suggested by Kashyap Desai Reviewed-by: Kashyap Desai Acked-by: Kashyap Desai > > Sumit, Kashyap: Please review this patch. > > -- > Martin K. PetersenOracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/3] megaraid_sas: Convert dev_printk to dev_
> > + dev_dbg(&instance->pdev->dev, "Error copying out > > cmd_status\n"); > > error = -EFAULT; > > } > > > > Reviewed-by: Johannes Thumshirn We will consider all three patches for future submission. As of now we have two patch set pending to be committed. We are working for few more patch in megaraid_sas which will do clean up in driver module + proper error handling of DCMD command timeout. It will cover Patch posted with below subject - [PATCH 3/3] megaraid_sas: return -ENOMEM when create DMA pool for cmd frames failed James - We will be resending these patch set on top of latest outstanding megaraid_sas driver patch, so that we can avoid any conflict in commits. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 02/12] megaraid_sas : Code optimization- remove PCI Ids based checks
> > > -Original Message- > > From: Martin K. Petersen [mailto:martin.peter...@oracle.com] > > Sent: Wednesday, October 28, 2015 8:03 AM > > To: sumit.sax...@avagotech.com > > Cc: linux-scsi@vger.kernel.org; the...@redhat.com; > > martin.peter...@oracle.com; h...@infradead.org; > > jbottom...@parallels.com; kashyap.de...@avagotech.com; > > kiran-kumar.kast...@avagotech.com; > > uday.ling...@avagotech.com > > Subject: Re: [PATCH 02/12] megaraid_sas : Code optimization- remove > > PCI > Ids > > based checks > > > > > "Sumit" == sumit saxena writes: > > > > Sumit> Code optimization: remove PCI id based checks and instead of > > Sumit> those use instance->ctrl_context to make call whether > > Sumit> controller is MFI based OR fusion adapter. fusion adapters > > Sumit> further are also divided in two categories- 1)THUNDERBOLT > > Sumit> SERIES and 2)INVADER SERIES. > > > > Does not apply against current upstream. Please fix and repost. > > Martin, This patch set is not based on current upstream but rebased on top > of few patches sent by me prior to this patch series. Patches sent prior > to > this patch series are not yet accepted. Martin - Here is last outstanding patch series. http://marc.info/?l=linux-scsi&m=144102204225400&w=2 Last patch series was missed because of small rebase issue. James suggested not to ACK anything which is not critical to avoid such issue and probably piggy back those patches. Last series and latest series has reviewed-by tags. Can we have outstanding patches committed in next window ? > > Thanks, > Sumit > > > > -- > > Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [patch 2/2] megaraid_sas: missing curly braces in megasas_detach_one()
> -Original Message- > From: Dan Carpenter [mailto:dan.carpen...@oracle.com] > Sent: Saturday, November 07, 2015 9:47 PM > To: Kashyap Desai; Sumit Saxena > Cc: Uday Lingala; James E.J. Bottomley; megaraidlinux@avagotech.com; > linux-scsi@vger.kernel.org; kernel-janit...@vger.kernel.org > Subject: [patch 2/2] megaraid_sas: missing curly braces in > megasas_detach_one() > > The indenting indicates that there are supposed to be some curly braces > here. Presumably it means we free something unintentionally leading to a > use after free. NACK. Indentation is wrong. We don't need to add braces. We will fix indentation and resend patch. Since this is not critical, let's ignore for now. > > Fixes: 3761cb4cf65e ('megaraid_sas: JBOD sequence number support') > Signed-off-by: Dan Carpenter > --- > Not tested. > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > b/drivers/scsi/megaraid/megaraid_sas_base.c > index 829e9e9..91e200d 100644 > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > @@ -5932,7 +5932,7 @@ static void megasas_detach_one(struct pci_dev > *pdev) > fusion->max_map_sz, > fusion->ld_map[i], > fusion->ld_map_phys[i]); > - if (fusion->ld_drv_map[i]) > + if (fusion->ld_drv_map[i]) { > free_pages((ulong)fusion->ld_drv_map[i], > fusion->drv_map_pages); > if (fusion->pd_seq_sync) > @@ -5940,6 +5940,7 @@ static void megasas_detach_one(struct pci_dev > *pdev) > pd_seq_map_sz, > fusion->pd_seq_sync[i], > fusion->pd_seq_phys[i]); > + } > } > free_pages((ulong)instance->ctrl_context, > instance->ctrl_context_pages); -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [patch 1/2] megaraid_sas: remove a stray tab
> -Original Message- > From: Dan Carpenter [mailto:dan.carpen...@oracle.com] > Sent: Saturday, November 07, 2015 9:46 PM > To: Kashyap Desai; Sumit Saxena > Cc: Uday Lingala; James E.J. Bottomley; megaraidlinux@avagotech.com; > linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org; kernel- > janit...@vger.kernel.org > Subject: [patch 1/2] megaraid_sas: remove a stray tab > > One line was indented more than the others. > > Signed-off-by: Dan Carpenter Let us combine your next patch (because that is also indentation issue) and this one. We will resubmit patch. For now ignore to avoid any issue for pending patches to be committed. > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c > b/drivers/scsi/megaraid/megaraid_sas_base.c > index cc95372..829e9e9 100644 > --- a/drivers/scsi/megaraid/megaraid_sas_base.c > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c > @@ -5923,9 +5923,9 @@ static void megasas_detach_one(struct pci_dev > *pdev) > > if (instance->ctrl_context) { > megasas_release_fusion(instance); > - pd_seq_map_sz = sizeof(struct > MR_PD_CFG_SEQ_NUM_SYNC) + > + pd_seq_map_sz = sizeof(struct > MR_PD_CFG_SEQ_NUM_SYNC) + > (sizeof(struct MR_PD_CFG_SEQ) * > - (MAX_PHYSICAL_DEVICES - 1)); > + (MAX_PHYSICAL_DEVICES - 1)); > for (i = 0; i < 2 ; i++) { > if (fusion->ld_map[i]) > dma_free_coherent(&instance->pdev->dev, -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH RESEND 09/25] mpt3sas: Don't send PHYDISK_HIDDEN Raid Action request on SAS2 HBA's
> -Original Message- > From: Hannes Reinecke [mailto:h...@suse.de] > Sent: Wednesday, November 11, 2015 6:27 PM > To: Sreekanth Reddy; j...@kernel.org > Cc: martin.peter...@oracle.com; linux-scsi@vger.kernel.org; > jbottom...@parallels.com; sathya.prak...@avagotech.com; > kashyap.de...@avagotech.com; linux-ker...@vger.kernel.org; > h...@infradead.org; chaitra.basa...@avagotech.com; suganath- > prabu.subram...@avagotech.com > Subject: Re: [PATCH RESEND 09/25] mpt3sas: Don't send PHYDISK_HIDDEN > Raid Action request on SAS2 HBA's > > On 11/11/2015 01:00 PM, Sreekanth Reddy wrote: > > From: Sreekanth Reddy > > > > Don't send PHYDISK_HIDDEN Raid Action request for SAS2 HBA's. > > Since these HBA's doesn't support this Raid Action. > > > > Also enable fast_path only for SAS3 HBA's. > > > > Signed-off-by: Sreekanth Reddy > > --- > > drivers/scsi/mpt3sas/mpt3sas_scsih.c | 19 +-- > > 1 file changed, 17 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c > > b/drivers/scsi/mpt3sas/mpt3sas_scsih.c > > index a638920..80469d0 100644 > > --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c > > +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c > > @@ -1165,8 +1165,10 @@ scsih_target_alloc(struct scsi_target *starget) > > if (test_bit(sas_device->handle, ioc->pd_handles)) > > sas_target_priv_data->flags |= > > MPT_TARGET_FLAGS_RAID_COMPONENT; > > +#ifndef SCSI_MPT2SAS > > if (sas_device->fast_path) > > sas_target_priv_data->flags |= > MPT_TARGET_FASTPATH_IO; > > +#endif > > } > > spin_unlock_irqrestore(&ioc->sas_device_lock, flags); > > > > @@ -3719,11 +3721,13 @@ scsih_qcmd(struct Scsi_Host *shost, struct > scsi_cmnd *scmd) > > ioc->build_zero_len_sge(ioc, &mpi_request->SGL); > > > > if (likely(mpi_request->Function == > MPI2_FUNCTION_SCSI_IO_REQUEST)) > > { > > +#ifndef SCSI_MPT2SAS > > if (sas_target_priv_data->flags & > MPT_TARGET_FASTPATH_IO) { > > mpi_request->IoFlags = cpu_to_le16(scmd- > >cmd_len | > > MPI25_SCSIIO_IOFLAGS_FAST_PATH); > > mpt3sas_base_put_smid_fast_path(ioc, smid, > handle); > > } else > > +#endif > > mpt3sas_base_put_smid_scsi_io(ioc, smid, handle); > > } else > > mpt3sas_base_put_smid_default(ioc, smid); @@ -5031,8 > +5035,10 @@ > > _scsih_add_device(struct MPT3SAS_ADAPTER *ioc, u16 handle, u8 > phy_num, > > sas_device->device_info = device_info; > > sas_device->sas_address = sas_address; > > sas_device->phy = sas_device_pg0.PhyNum; > > +#ifndef SCSI_MPT2SAS > > sas_device->fast_path = (le16_to_cpu(sas_device_pg0.Flags) & > > MPI25_SAS_DEVICE0_FLAGS_FAST_PATH_CAPABLE) ? 1 : 0; > > +#endif > > > > if (sas_device_pg0.Flags & > MPI2_SAS_DEVICE0_FLAGS_ENCL_LEVEL_VALID) { > > sas_device->enclosure_level = > > @@ -5731,6 +5737,7 @@ _scsih_sas_discovery_event(struct > MPT3SAS_ADAPTER *ioc, > > } > > } > > > > +#ifndef SCSI_MPT2SAS > > /** > > * _scsih_ir_fastpath - turn on fastpath for IR physdisk > > * @ioc: per adapter object > > @@ -5750,7 +5757,6 @@ _scsih_ir_fastpath(struct MPT3SAS_ADAPTER > *ioc, u16 handle, u8 phys_disk_num) > > u16 ioc_status; > > u32 log_info; > > > > - > > mutex_lock(&ioc->scsih_cmds.mutex); > > > > if (ioc->scsih_cmds.status != MPT3_CMD_NOT_USED) { @@ - > 5825,6 > > +5831,8 @@ _scsih_ir_fastpath(struct MPT3SAS_ADAPTER *ioc, u16 > handle, u8 phys_disk_num) > > FORCE_BIG_HAMMER); > > return rc; > > } > > +/* End of not defined SCSI_MPT2SAS */ #endif > > > > /** > > * _scsih_reprobe_lun - reprobing lun @@ -6017,8 +6025,10 @@ > > _scsih_sas_pd_hide(struct MPT3SAS_ADAPTER *ioc, > > if (!sas_device) > > return; > > > > +#ifndef SCSI_MPT2SAS > > /* hiding raid component */ > > _scsih_ir_fastpath(ioc, handle, element->PhysDiskNum); > > +#endif > > if (starget) > > starget_for_each_device(starget, (void *)1, > _scsih_reprobe_lun); } > > @@ -6067,7 +6077,9 @@ _scsih_sas_pd_add(struct MPT3SAS_ADAPTER > *ioc, > > sas_device = _scsih_sas_device_find_by_handle(ioc, handle); > > spin_unlock_irqrestore(&ioc->sas_device_lock, flags); > > if (sas_device) { > > +#ifndef SCSI_MPT2SAS > > _scsih_ir_fastpath(ioc, handle, element->PhysDiskNum); > > +#endif > > return; > > } > > > > @@ -6091,7 +6103,9 @@ _scsih_sas_pd_add(struct MPT3SAS_ADAPTER > *ioc, > > mpt3sas_transport_update_links(ioc, sas_address, handle, > > sas_device_pg0.PhyNum, > MPI2_SAS_NEG_LINK_RATE_1_5); > > > > +#ifndef SCSI_MPT2SAS > > _scsih_ir_fastpath(ioc, handle, element->PhysDiskNum); > > +#endif > > _scsih_add_device(ioc, handle, 0, 1); } > > > > @@ -6202,13 +6216,14 @@ _scsih_sas_ir_config_change_event(struct > > MPT3SAS_ADAPTER *ioc, > > > >
RE: How to get more sequential IO merged at elevator
> -Original Message- > From: Jens Axboe [mailto:ax...@kernel.dk] > Sent: Tuesday, May 06, 2014 7:47 PM > To: Desai, Kashyap > Cc: linux-scsi@vger.kernel.org > Subject: Re: How to get more sequential IO merged at elevator > > On 05/06/2014 04:06 AM, Desai, Kashyap wrote: > > I got some clue on what was going on while doing 4K sequential read using > fio. > > > > If I use ioengine in fio script as "libaio", I see " do_io_submit" call > plug/unplug the queue before submitting the IO. > > It means after every IO, we expect to send IO immediately to the next > layer. If at all there are any pending IO do merge.. but not due to plugging. > > > > This is what happens on my test. Every time IO comes from application > > with libaio engine, it send down to the elevator/io-scheduler because > queue was unplugged in do_io_submit(). Moment I reduce the queue depth > of the block device, merge start because of congestion at scsi mid layer. > > > > If I use, mmap engine, I see merged IO coming to the device driver > > because of plugging. I really don't know how it works, but gave a try > > and found merge happen because of plugging. ( I confirm using > > blktrace) > > > > Is there any ioengine in (or any other parameter setting), which can > use plugging mechanism of block layer to merge more IO other than mmap ? > > O_DIRECT IO is sync by nature, which is why it is sent off immediately instead > of held for potentially merging. mmap is not. You should be able to provoke > merging by submitting more than 1 IO at the time. See the iodepth_batch > settings for fio. Thanks Jens. I got your point about O_DIRECT IO. When I read page of it mentioned default iodepth_batch and iodepth_low will be same as iodepth value.. but in my case I have to explicitly provide those parameters. It looks like default value for ipdepth_bacch/low is 1. Once I provide fio parameter "iodepth_batch=32" and "iodepth_low=32", I see IO in batch before plug and after unplug from application. I am able to get now max merged IO. Thanks for helping me on this. ~ Kashyap > > -- > Jens Axboe > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: How to get more sequential IO merged at elevator
> -Original Message- > From: Jens Axboe [mailto:ax...@kernel.dk] > Sent: Tuesday, May 06, 2014 7:47 PM > To: Desai, Kashyap > Cc: linux-scsi@vger.kernel.org > Subject: Re: How to get more sequential IO merged at elevator > > On 05/06/2014 04:06 AM, Desai, Kashyap wrote: > > I got some clue on what was going on while doing 4K sequential read using > fio. > > > > If I use ioengine in fio script as "libaio", I see " do_io_submit" call > plug/unplug the queue before submitting the IO. > > It means after every IO, we expect to send IO immediately to the next > layer. If at all there are any pending IO do merge.. but not due to plugging. > > > > This is what happens on my test. Every time IO comes from application > > with libaio engine, it send down to the elevator/io-scheduler because > queue was unplugged in do_io_submit(). Moment I reduce the queue depth > of the block device, merge start because of congestion at scsi mid layer. > > > > If I use, mmap engine, I see merged IO coming to the device driver > > because of plugging. I really don't know how it works, but gave a try > > and found merge happen because of plugging. ( I confirm using > > blktrace) > > > > Is there any ioengine in (or any other parameter setting), which can > use plugging mechanism of block layer to merge more IO other than mmap ? > > O_DIRECT IO is sync by nature, which is why it is sent off immediately instead > of held for potentially merging. mmap is not. You should be able to provoke > merging by submitting more than 1 IO at the time. See the iodepth_batch > settings for fio. Thanks Jens. I got your point about O_DIRECT IO. When I read page of it mentioned default iodepth_batch and iodepth_low will be same as iodepth value.. but in my case I have to explicitly provide those parameters. (I explore this after you pointed out ). It looks like default value for iodepth_batch/low is not same as iodepth. Once I provide fio parameter "iodepth_batch=32" and "iodepth_low=32", I see IO in batch before plug and after unplug from application. I am able to get now max merged IO. Thanks for helping me on this. ~ Kashyap > > -- > Jens Axboe > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag
V1 -> V2 Added fix in __blk_mq_finish_request around blk_mq_put_tag() for non-internal tags Problem statement : Whenever try to get outstanding request via scsi_host_find_tag, block layer will return stale entries instead of actual outstanding request. Kernel panic if stale entry is inaccessible or memory is reused. Fix : Undo request mapping in blk_mq_put_driver_tag nce request is return. More detail : Whenever each SDEV entry is created, block layer allocate separate tags and static requestis.Those requests are not valid after SDEV is deleted from the system. On the fly, block layer maps static rqs to rqs as below from blk_mq_get_driver_tag() data.hctx->tags->rqs[rq->tag] = rq; Above mapping is active in-used requests and it is the same mapping which is referred in function scsi_host_find_tag(). After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some entries which will never be reset in block layer. There would be a kernel panic, If request pointing to “data.hctx->tags->rqs[rq->tag]” is part of “sdev” which is removed and as part of that all the memory allocation of request associated with that sdev might be reused or inaccessible to the driver. Kernel panic snippet - BUG: unable to handle kernel paging request at ff800010 IP: [] mpt3sas_scsih_scsi_lookup_get+0x6c/0xc0 [mpt3sas] PGD aa4414067 PUD 0 Oops: [#1] SMP Call Trace: [] mpt3sas_get_st_from_smid+0x1f/0x60 [mpt3sas] [] scsih_shutdown+0x55/0x100 [mpt3sas] Cc: Signed-off-by: Kashyap Desai Signed-off-by: Sreekanth Reddy --- block/blk-mq.c | 4 +++- block/blk-mq.h | 1 + 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 6a75662..88d1e92 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -477,8 +477,10 @@ static void __blk_mq_free_request(struct request *rq) const int sched_tag = rq->internal_tag; blk_pm_mark_last_busy(rq); -if (rq->tag != -1) +if (rq->tag != -1) { +hctx->tags->rqs[rq->tag] = NULL; blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag); +} if (sched_tag != -1) blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag); blk_mq_sched_restart(hctx); diff --git a/block/blk-mq.h b/block/blk-mq.h index 9497b47..57432be 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -175,6 +175,7 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx) static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq) { +hctx->tags->rqs[rq->tag] = NULL; blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag); rq->tag = -1; -- 1.8.3.1
RE: [PATCH V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag
> > > > Other block drivers (e.g. ib_srp, skd) do not need this to work > > reliably. > > It has been explained to you that the bug that you reported can be > > fixed by modifying the mpt3sas driver. So why to fix this by modifying > > the block layer? Additionally, what prevents that a race condition > > occurs between the block layer clearing hctx->tags->rqs[rq->tag] and > > scsi_host_find_tag() reading that same array element? I'm afraid that > > this is an attempt to paper over a real problem instead of fixing the > > root > cause. > > I have to agree with Bart here, I just don't see how the mpt3sas use case > is > special. The change will paper over the issue in any case. Hi Jens, Bart One of the key requirement for iterating whole tagset using scsi_host_find_tag is to block scsi host. Once we are done that, we should be good. No race condition is possible if that part is taken care. Without this patch, if driver still receive scsi command from the hctx->tags->rqs which is really not outstanding. I am finding this is common issue for many scsi low level drivers. Just for example - fnic_is_abts_pending() function has below code - for (tag = 0; tag < fnic->fnic_max_tag_id; tag++) { sc = scsi_host_find_tag(fnic->lport->host, tag); /* * ignore this lun reset cmd or cmds that do not belong to * this lun */ if (!sc || (lr_sc && (sc->device != lun_dev || sc == lr_sc))) continue; Above code also has similar exposure of kernel panic like driver while accessing sc->device. Panic is more obvious if we have add/removal of scsi device before looping through scsi_host_find_tag(). Avoiding block layer changes is also attempted in but our problem is to convert that code common for non-mq and mq. Temporary to unblock this issue, We have fixed using driver internals scsiio_tracker() instead of piggy back in scsi_command. Kashyap > > -- > Jens Axboe
RE: [PATCH V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag
> > On 12/18/18 10:08 AM, Kashyap Desai wrote: > >>> > >>> Other block drivers (e.g. ib_srp, skd) do not need this to work > >>> reliably. > >>> It has been explained to you that the bug that you reported can be > >>> fixed by modifying the mpt3sas driver. So why to fix this by > >>> modifying the block layer? Additionally, what prevents that a race > >>> condition occurs between the block layer clearing > >>> hctx->tags->rqs[rq->tag] and > >>> scsi_host_find_tag() reading that same array element? I'm afraid > >>> that this is an attempt to paper over a real problem instead of > >>> fixing the root > >> cause. > >> > >> I have to agree with Bart here, I just don't see how the mpt3sas use > >> case is special. The change will paper over the issue in any case. > > > > Hi Jens, Bart > > > > One of the key requirement for iterating whole tagset using > > scsi_host_find_tag is to block scsi host. Once we are done that, we > > should be good. No race condition is possible if that part is taken > > care. > > Without this patch, if driver still receive scsi command from the > > hctx->tags->rqs which is really not outstanding. I am finding this is > > common issue for many scsi low level drivers. > > > > Just for example - fnic_is_abts_pending() function has below > > code - > > > > for (tag = 0; tag < fnic->fnic_max_tag_id; tag++) { > > sc = scsi_host_find_tag(fnic->lport->host, tag); > > /* > > * ignore this lun reset cmd or cmds that do not belong > > to > > * this lun > > */ > > if (!sc || (lr_sc && (sc->device != lun_dev || sc == > > lr_sc))) > > continue; > > > > Above code also has similar exposure of kernel panic like > > driver while accessing sc->device. > > > > Panic is more obvious if we have add/removal of scsi device before > > looping through scsi_host_find_tag(). > > > > Avoiding block layer changes is also attempted in but our > > problem is to convert that code common for non-mq and mq. > > Temporary to unblock this issue, We have fixed using driver > > internals scsiio_tracker() instead of piggy back in scsi_command. > > For mq, the requests never go out of scope, they are always valid. So the > key > question here is WHY they have been freed. If the queue gets killed, then > one > potential solution would be to clear pointers in the tag map belonging to > that > queue. That also takes it out of the hot path. At driver load whenever driver does scsi_add_host_with_dma(), it follows below code path in block layer. scsi_mq_setup_tags ->blk_mq_alloc_tag_set -> blk_mq_alloc_rq_maps -> __blk_mq_alloc_rq_maps SML create two set of request pool. One is per HBA and other is per SDEV. I was confused why SML creates request pool per HBA. Example - IF HBA queue depth is 1K and there are 8 device behind that HBA, total request pool is created is 1K + 8 * scsi_device queue depth. 1K will be always static, but other request pool is managed whenever scsi device is added/removed. I never observe requests allocated per HBA is used in IO path. It is always request allocated per scsi device is what active. Also, what I observed is whenever scsi_device is deleted, associated request is also deleted. What is missing is - "Deleted request still available in hctx->tags->rqs[rq->tag]." IF there is an assurance that all the request will be valid as long as hctx is available, this patch is not correct. I posted patch based on assumption that request per hctx can be removed whenever scsi device is removed. Kashyap > > -- > Jens Axboe
RE: [PATCH V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag
> On 12/18/18 10:48 AM, Kashyap Desai wrote: > >> > >> On 12/18/18 10:08 AM, Kashyap Desai wrote: > >>>>> > >>>>> Other block drivers (e.g. ib_srp, skd) do not need this to work > >>>>> reliably. > >>>>> It has been explained to you that the bug that you reported can be > >>>>> fixed by modifying the mpt3sas driver. So why to fix this by > >>>>> modifying the block layer? Additionally, what prevents that a race > >>>>> condition occurs between the block layer clearing > >>>>> hctx->tags->rqs[rq->tag] and > >>>>> scsi_host_find_tag() reading that same array element? I'm afraid > >>>>> that this is an attempt to paper over a real problem instead of > >>>>> fixing the root > >>>> cause. > >>>> > >>>> I have to agree with Bart here, I just don't see how the mpt3sas > >>>> use case is special. The change will paper over the issue in any > >>>> case. > >>> > >>> Hi Jens, Bart > >>> > >>> One of the key requirement for iterating whole tagset using > >>> scsi_host_find_tag is to block scsi host. Once we are done that, we > >>> should be good. No race condition is possible if that part is taken > >>> care. > >>> Without this patch, if driver still receive scsi command from the > >>> hctx->tags->rqs which is really not outstanding. I am finding this > >>> hctx->tags->is > >>> common issue for many scsi low level drivers. > >>> > >>> Just for example - fnic_is_abts_pending() function has below > >>> code - > >>> > >>> for (tag = 0; tag < fnic->fnic_max_tag_id; tag++) { > >>> sc = scsi_host_find_tag(fnic->lport->host, tag); > >>> /* > >>> * ignore this lun reset cmd or cmds that do not > >>> belong to > >>> * this lun > >>> */ > >>> if (!sc || (lr_sc && (sc->device != lun_dev || sc == > >>> lr_sc))) > >>> continue; > >>> > >>> Above code also has similar exposure of kernel panic like > >>> driver while accessing sc->device. > >>> > >>> Panic is more obvious if we have add/removal of scsi device before > >>> looping through scsi_host_find_tag(). > >>> > >>> Avoiding block layer changes is also attempted in but our > >>> problem is to convert that code common for non-mq and mq. > >>> Temporary to unblock this issue, We have fixed using > >>> driver internals scsiio_tracker() instead of piggy back in > >>> scsi_command. > >> > >> For mq, the requests never go out of scope, they are always valid. So > >> the key question here is WHY they have been freed. If the queue gets > >> killed, then one potential solution would be to clear pointers in the > >> tag map belonging to that queue. That also takes it out of the hot > >> path. > > > > At driver load whenever driver does scsi_add_host_with_dma(), it > > follows below code path in block layer. > > > > scsi_mq_setup_tags > > ->blk_mq_alloc_tag_set > > -> blk_mq_alloc_rq_maps > > -> __blk_mq_alloc_rq_maps > > > > SML create two set of request pool. One is per HBA and other is per > > SDEV. I was confused why SML creates request pool per HBA. > > > > Example - IF HBA queue depth is 1K and there are 8 device behind that > > HBA, total request pool is created is 1K + 8 * scsi_device queue > > depth. 1K will be always static, but other request pool is managed > > whenever scsi device is added/removed. > > > > I never observe requests allocated per HBA is used in IO path. It is > > always request allocated per scsi device is what active. > > Also, what I observed is whenever scsi_device is deleted, associated > > request is also deleted. What is missing is - "Deleted request still > > available in > > hctx->tags->rqs[rq->tag]." > > So that sounds like the issue. If the device is deleted and its requests > go away, > those pointers should be cleared. That's what your patch should do, not do > it > for each IO. At the time of device removal, it requires reverse traversing. Find out if each requests associated with sdev is part of hctx->tags->rqs() and clear that entry. Not sure about atomic traverse if more than one device removal is happening in parallel. May be more error prone. ? Just wondering both the way we will be removing invalid request from array. Are you suspecting any performance issue if we do it per IO ? Kashyap > > > -- > Jens Axboe
RE: [PATCH V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag
> > > > At the time of device removal, it requires reverse traversing. Find > > out if each requests associated with sdev is part of hctx->tags->rqs() > > and clear that entry. > > Not sure about atomic traverse if more than one device removal is > > happening in parallel. May be more error prone. ? > > > > Just wondering both the way we will be removing invalid request from > > array. > > Are you suspecting any performance issue if we do it per IO ? > > It's an extra store, and it's a store to an area that's then now shared > between > issue and completion. Those are never a good idea. Besides, it's the kind > of > issue you solve in the SLOW path, not in the fast path. Since that's > doable, it > would be silly to do it for every IO. > > This might not matter on mpt3sas, but on more efficient hw it definitely > will. Understood your primary concern is to avoid per IO and do it if no better way. > I'm still trying to convince myself that this issue even exists. I can see > having > stale entries, but those should never be busy. Why are you finding them > with > the tag iteration? It must be because the tag is reused, and you are > finding it > before it's re-assigned? Stale entries will be forever if we remove scsi devices. It is not timing issue. If memory associated with request (freed due to device removal) reused, kernel panic occurs. We have 24 Drives behind Expander and follow expander reset which will remove all 24 drives and add it back. Add and removal of all the drives happens quickly. As part of Expander reset, driver process broadcast primitive event and that requires finding all outstanding scsi command. In some cases, we need firmware restart and that path also requires tag iteration. > > -- > Jens Axboe
RE: [PATCH V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag
> > I actually took a look at scsi_host_find_tag() - what I think needs fixing > here is > that it should not return a tag that isn't allocated. > You're just looking up random stuff, that is a recipe for disaster. > But even with that, there's no guarantee that the tag isn't going away. Got your point. Let us fix in driver. > > The mpt3sas use case is crap. It's iterating every tag, just in case it > needs to do > something to it. Many drivers in scsi layer is having similar trouble. May be they are less exposed. That was a main reason, I thought to provide common fix in block layer. > > My suggestion would be to scrap that bad implementation and have > something available for iterating busy tags instead. That'd be more > appropriate and a lot more efficient that a random loop from 0..depth. > If you are flushing running commands, looking up tags that aren't even > active > is silly and counterproductive. We will address this issue through driver changes in two steps. 1. I can use driver's internal memory and will not rely on request/scsi command. Tag 0...depth loop is not in main IO path, so what we need is contention free access of the list. Having driver's internal memory and array will provide that control. 2. As you suggested best way is to use busy tag iteration. (only for blk-mq stack) Thanks for your feedback. Kashyap > > -- > Jens Axboe
RE: [PATCH] megaraid_sas: enable blk-mq for fusion
> Fusion adapters can steer completions to individual queues, so we can enable > blk-mq for those adapters. > And in doing so we can rely on the interrupt affinity from the block layer and > drop the hand-crafted 'reply_map' construct. Hannes, I understand the intend of this patch as we discussed couple of time on same topic in past. We really don't need such interface in MR/IT HBA because we have single submission queue (h/w registers.). Mimicing multiple h/w submission queue really does not help any performance. Relevant discussion at below - https://marc.info/?l=linux-scsi&m=153078454611933&w=2 We can hit max h/w and storage performance limit using nr_hw_queues = 1. We could not find any reason to really setup more than one hw queue. Kashyap
Re: [PATCH v1] mpt3sas: Use driver scsi lookup to track outstanding IOs
On Tue, Feb 26, 2019 at 8:23 PM Hannes Reinecke wrote: > > On 2/26/19 3:33 PM, Christoph Hellwig wrote: > > On Tue, Feb 26, 2019 at 02:49:30PM +0100, Hannes Reinecke wrote: > >> Attached is a patch to demonstrate my approach. > >> I am aware that it'll only be useful for latest upstream where the legacy > >> I/O path has been dropped completely, so we wouldn't need to worry about > >> it. > >> For older releases indeed you would need to with something like your > >> proposed patch, but for upstream I really would like to switch to > >> blk_mq_tagset_busy() iter. > > > > While this is better than the driver private tracking we really should > > not have to iterate all outstanding command, because if we have any > > there is a bug we need to fix in the higher layers instead of working > > around it in the drivers. Hi Chris, Looking at other driver's code, I think similar issue is impacted to many scsi hbas (like fnic, qla4xxx, snic etc..). They also need similar logic to traverse outstanding scsi commands. One of the example is - At the time of HBA reset driver would like to release scsi command back to SML to retry and for that driver requires to loop all possible smid to figure out outstanding scsi command @Firmware/SML level. > > > Ah-ha. > > But what else should we be doing here? > The driver needs to abort all outstanding commands; I somewhat fail to > see how we could be doing it otherwise ... Hannes, Primary drawback of using "blk_mq_tagset_busy_iter" is API is only for blk-mq and it is not available for all the kernel with blk-mq support. We have seen multiple failures from customer and those kernels does not support blk_mq_tagset_busy_iter. In fact, blk-mq and non-mq stack is alive in many Linux distribution and customer is using those. If we just scope to fix current kernel 5.x (which does not have non-mq support), we can opt blk_mq_tagset_busy_iter(). Earlier I requested upstream to accept driver changes without blk_mq_tagset_busy_iter() because it is hitting many customers so we are looking for generic fix which can server both blk-mq as well as non-mq. We will certainly enhance and optimize working this area (driver trying to find outstanding scsi command) with scsi mailing list as incremental approach. Kashyap > > Cheers, > > Hannes >
[scsi] write same with NDOB (no data-out buffer) support
Hi Martin/Chris, I was going through below discussion as well as going through linux scsi code to know if linux scsi stack support NDOB. It looks like current scsi stack does not have support of NDOB. From below thread, it looks like NDOB is not good way to support considering lack of device level support is not stable. Below is bit old discussion, so want to understand if current situation is good to introduce NDOB support (considering underlying device works as expected with NDOB) ? https://lore.kernel.org/patchwork/patch/673046/ One more question. What happens if WS w/ UNMAP command is passed to the device without zeroed data out buffer in current scsi stack ? Will it permanently disable WS on that device ? Thanks, Kashyap