date:20190428

Re: Kernel panic w/ message request_threaded_irq -> qla2x00_request_irqs -> qla2x00_probe_one -> mod_timer

2019-04-28 Thread TomK


On 4/15/2019 10:26 PM, TomK wrote:

On 4/15/2019 3:35 PM, Laurence Oberman wrote:

On Mon, 2019-04-15 at 08:39 -0700, Bart Van Assche wrote:

On Mon, 2019-04-15 at 08:55 -0400, Laurence Oberman wrote:

On Sun, 2019-04-14 at 23:25 -0400, TomK wrote:

Hey All,

I'm getting a kernel panic on an Gigabyte GA-890XA-UD3
motherboard
that
I've got a QLE2464 card in as a target (FC).  The kernel has
been
crashing / panicking in the last 1-2 months about once a
week.  Before
that, it was rock solid for 4-5 years.  I've upgraded to kernel
4.18.19
but that hasn't made much of a difference.  Since the message
includes
qla2x00_request_irqs I thought I would try here first.

Tried to get more info on this but:

1) Keyboard doesn't work and locks up when the panic occurs.  No
USB
ports work.  Tried the PS/2 port but nothing.

2) Unable to capture a kdump.  Can't get to the kdump vmcore due
to
1).

The two screenshots is pretty much all I can capture. Tried
things
like
clocksource=rtc in the kernel parms and disabling hpet1 but
apparently I
haven't disabled it everywhere since it still shows up.

Wondering if anyone recognizes these messages or has any idea
what
could
be the issue here?  Even a hint would be appreciated.



Hello Tom
I have had similar issues and reported them to Himanshu@Cavium
I have kept all my target servers at kernel 4.5 as it been the only
version that has always been stable.
If your motherboard has an NMI (virtual or physical) set all of
these
in /etc/sysctl.conf
Run sysctl -a;dracut -f and reboot

kernel.nmi_watchdog = 1
kernel.panic_on_io_nmi = 1
kernel.panic_on_unrecovered_nmi =
kernel.unknown_nmi_panic = 1

When the issue shows up press the virtual/physical NMI

This is with the assumption that generic kdump is properly setup
and
dmesg | grep crash shows memory resrved by the crashkernel and that
you
have tested kdump manually.

Other options are use a USB serial port to capture the full log if
you
cannot get kdump to work.


That approach may provide further evidence about kernel bugs but it
is not
guaranteed that that approach will lead to a solution. It would help
if
either or both of you could do the following on a test system:
* Check out branch qla2xxx-for-next of my kernel repo on github
(https://github.com/bvanassche/linux/tree/qla2xxx-for-next).
* Enable lockdep and KASAN in the kernel config (CONFIG_PROVE_LOCKING
and
   CONFIG_KASAN).
* Build and install that kernel.
* Run your favorite workload.

Please note that the qla2xxx-for-next branch is based on the v5.1-rc1
kernel
and hence should not be installed on any production system.

Thanks,

Bart.


Hello Bart
OK, I will get to this by Thursday, wont be able to change the
targetserver kernel until then.
Regards
Laurence



Same.  I'll try this out closer to the weekend.

Not an NMI motherboard.  This is a 9-10 year old AMD board meant as a 
desktop or home server.


I'll have to read more about the USB Serial port to capture further 
info.  That's interesting.


For the time being, I've disabled HPET in BIOS.  ( Appears the kernel 
boot parameter method wasn't enough. )




Hey Guy's,

Did some of what you suggested, including the USB serial setup:

1) One of DB9 RS232 Serial Null Modem Cable F/F
2) Two of USB to RS232 Serial Port DB9 9 Pin

however, when the kernel came down it took the USB support with it and 
so minicom went offline:


  CTRL-A Z for help |115200 8N1 | NOR | Minicom 2.6.2  | VT102 |  
Offline


But I did enable full logging for the QLA module:

echo 0x7fff > /sys/module/qla2xxx/parameters/ql2xextended_error_logging

Did all that, minus the Kernel v5.1-rc1 implementation, and this is what 
was picked up from the minicom USB to Serial capture before things went 
south:


1235905 ^Mqla2xxx [:04:00.0]-e818: is_send_status=1, 
cmd->bufflen=512, cmd->sg_cnt=1, cmd-> dma_data_directi on=1 se_cmd[ 
9c9ea758] qp 0
1235906 ^Mqla2xxx [:04:00.0]-e818: is_send_status=1, 
cmd->bufflen=4096, cmd->sg_cnt=0, cmd- >dma_data_direct ion=2 se_cmd[000 
096ae11b7] q p 0
1235907 ^Mqla2xxx [:04:00.0]-e818: is_send_status=1, 
cmd->bufflen=20480, cmd->sg_cnt=0, cmd ->dma_data_direc tion=2 se_cmd[00 
001738f793] qp 0
1235908 ^Mqla2xxx [:04:00.0]-e818: is_send_status=1, 
cmd->bufflen=20480, cmd->sg_cnt=0, cmd ->dma_data_direc tion=2 se_cmd[00 
00e8160a90] qp 0
1235909 ^MDetected MISCOMPARE for addr: 33045258 buf: 
f9849912

1235910 ^MTarget/fileio: Send MISCOMPARE check condition and sense
1235911 ^Mqla2xxx [:04:00.0]-e818: is_send_status=1, 
cmd->bufflen=512, cmd->sg_cnt=0, cmd-> dma_data_directi on=2 se_cmd[ 
363ae214] qp 0
1235912 ^Mqla2xxx [:04:00.0]-e817: Skipping EXPLICIT_CONFORM and 
CTIO7_FLAGS_CONFORM_REQ fo r FCP READ w/ no n GOOD status
1235913 ^Mqla2xxx [:04:00.0]-e874:2: qlt_free_cmd: 
se_cmd[1db805fd] ox_id 00c8
1235914 ^Mqla2xxx [:04:00.0]-e872:2: qlt_24xx_atio_pkt_all_vps: 
qla_target(0): type 6 ox_id 00db
1235915 ^Mqla2xxx [:04:00.0]

[PATCH V4 0/3] scsi: core: avoid big pre-allocation for sg list

2019-04-28 Thread Ming Lei

Hi,

Since supporting to blk-mq, big pre-allocation for sg list is introduced,
this way is very unfriendly wrt. memory consumption.

There were Red Hat internal reports that some scsi_debug based tests
can't be run any more because of too big pre-allocation.

Also lpfc users commplained that 1GB+ ram is pre-allocatd for single
HBA.

sg_alloc_table_chained() is improved to support variant size of 1st
pre-allocated SGL in the 1st patch as suggested by Christoph.

The other two patches try to address this issue by allocating sg list runtime,
meantime pre-allocating one or two inline sg entries for small IO. This
ways follows NVMe's approach wrt. sg list allocation.

V4:
- add parameter to sg_alloc_table_chained()/sg_free_table_chained()
directly, and update current callers

V3:
- improve sg_alloc_table_chained() to accept variant size of
the 1st pre-allocated SGL
- applies the improved sg API to address the big pre-allocation
issue

V2:
- move inline sg table initializetion into one helper
- introduce new helper for getting inline sg
- comment log fix


Ming Lei (3):
  lib/sg_pool.c: improve APIs for allocating sg pool
  scsi: core: avoid to pre-allocate big chunk for protection meta data
  scsi: core: avoid to pre-allocate big chunk for sg list

 drivers/nvme/host/fc.c|  7 ---
 drivers/nvme/host/rdma.c  |  7 ---
 drivers/nvme/target/loop.c|  4 ++--
 drivers/scsi/scsi_lib.c   | 31 ++-
 include/linux/scatterlist.h   | 11 +++
 lib/scatterlist.c | 36 +++-
 lib/sg_pool.c | 37 +++--
 net/sunrpc/xprtrdma/svc_rdma_rw.c |  5 +++--
 8 files changed, 92 insertions(+), 46 deletions(-)

Cc: Christoph Hellwig 
Cc: Bart Van Assche 
Cc: Ewan D. Milne 
Cc: Hannes Reinecke 
Cc: Sagi Grimberg 
Cc: Chuck Lever 
Cc: net...@vger.kernel.org
Cc: linux-n...@lists.infradead.org

-- 
2.9.5

[PATCH V4 1/3] lib/sg_pool.c: improve APIs for allocating sg pool

2019-04-28 Thread Ming Lei

Now sg_alloc_table_chained() allows user to provide one preallocated
SGL, and returns simply if the requested number isn't bigger than size of
that SGL. This way is nice for inline small SGL to small IO request.

However, scattergather code only allows that size of the 1st preallocated
SGL is SG_CHUNK_SIZE(128), and this way isn't flexiable and useful, because
it may take too much memory(4KB) to pre-allocat one such size SGL for each
IO request, especially block layer always pre-allocates IO request structure.
Instead it is more friendly to pre-allocate one small size inline SGL just
for small IO.

Introduces one extra parameter to sg_alloc_table_chained() and 
sg_free_table_chained()
for specifying size of the pre-allocated SGL, then the 'first_chunk' SGL can
include any number of entries.

Both __sg_free_table() and __sg_alloc_table() assumes that each SGL has
same size except for the last one, changes code to allow both to accept
variant size for the 1st preallocated SGL.

Cc: Christoph Hellwig 
Cc: Bart Van Assche 
Cc: Ewan D. Milne 
Cc: Hannes Reinecke 
Cc: Sagi Grimberg 
Cc: Chuck Lever 
Cc: net...@vger.kernel.org
Cc: linux-n...@lists.infradead.org
Suggested-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 drivers/nvme/host/fc.c|  7 ---
 drivers/nvme/host/rdma.c  |  7 ---
 drivers/nvme/target/loop.c|  4 ++--
 drivers/scsi/scsi_lib.c   | 10 ++
 include/linux/scatterlist.h   | 11 +++
 lib/scatterlist.c | 36 +++-
 lib/sg_pool.c | 37 +++--
 net/sunrpc/xprtrdma/svc_rdma_rw.c |  5 +++--
 8 files changed, 76 insertions(+), 41 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 31637f8ef22e..9b42ef82a273 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2112,7 +2112,8 @@ nvme_fc_map_data(struct nvme_fc_ctrl *ctrl, struct 
request *rq,
 
freq->sg_table.sgl = freq->first_sgl;
ret = sg_alloc_table_chained(&freq->sg_table,
-   blk_rq_nr_phys_segments(rq), freq->sg_table.sgl);
+   blk_rq_nr_phys_segments(rq), freq->sg_table.sgl,
+   SG_CHUNK_SIZE);
if (ret)
return -ENOMEM;
 
@@ -2122,7 +2123,7 @@ nvme_fc_map_data(struct nvme_fc_ctrl *ctrl, struct 
request *rq,
freq->sg_cnt = fc_dma_map_sg(ctrl->lport->dev, freq->sg_table.sgl,
op->nents, dir);
if (unlikely(freq->sg_cnt <= 0)) {
-   sg_free_table_chained(&freq->sg_table, true);
+   sg_free_table_chained(&freq->sg_table, SG_CHUNK_SIZE);
freq->sg_cnt = 0;
return -EFAULT;
}
@@ -2148,7 +2149,7 @@ nvme_fc_unmap_data(struct nvme_fc_ctrl *ctrl, struct 
request *rq,
 
nvme_cleanup_cmd(rq);
 
-   sg_free_table_chained(&freq->sg_table, true);
+   sg_free_table_chained(&freq->sg_table, SG_CHUNK_SIZE);
 
freq->sg_cnt = 0;
 }
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 11a5ecae78c8..2022a196d2a7 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1155,7 +1155,7 @@ static void nvme_rdma_unmap_data(struct nvme_rdma_queue 
*queue,
WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
 
nvme_cleanup_cmd(rq);
-   sg_free_table_chained(&req->sg_table, true);
+   sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE);
 }
 
 static int nvme_rdma_set_sg_null(struct nvme_command *c)
@@ -1270,7 +1270,8 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue 
*queue,
 
req->sg_table.sgl = req->first_sgl;
ret = sg_alloc_table_chained(&req->sg_table,
-   blk_rq_nr_phys_segments(rq), req->sg_table.sgl);
+   blk_rq_nr_phys_segments(rq), req->sg_table.sgl,
+   SG_CHUNK_SIZE);
if (ret)
return -ENOMEM;
 
@@ -1310,7 +1311,7 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue 
*queue,
req->nents, rq_data_dir(rq) ==
WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE);
 out_free_table:
-   sg_free_table_chained(&req->sg_table, true);
+   sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE);
return ret;
 }
 
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index b9f623ab01f3..b3e2f42a3644 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -77,7 +77,7 @@ static void nvme_loop_complete_rq(struct request *req)
struct nvme_loop_iod *iod = blk_mq_rq_to_pdu(req);
 
nvme_cleanup_cmd(req);
-   sg_free_table_chained(&iod->sg_table, true);
+   sg_free_table_chained(&iod->sg_table, SG_CHUNK_SIZE);
nvme_complete_rq(req);
 }
 
@@ -171,7 +171,7 @@ static blk_status_t nvme_loop_queue_rq(struct blk_mq_hw_ctx 
*hctx,
iod->sg_table.sgl = iod->fir

Re: [PATCH V4 1/3] lib/sg_pool.c: improve APIs for allocating sg pool

2019-04-28 Thread Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig

[PATCH V4 2/3] scsi: core: avoid to pre-allocate big chunk for protection meta data

2019-04-28 Thread Ming Lei

Now scsi_mq_setup_tags() pre-allocates a big buffer for protection
sg entries, and the buffer size is scsi_mq_sgl_size().

This way isn't correct, scsi_mq_sgl_size() is used to pre-allocate
sg entries for IO data. And the protection data buffer is much less,
for example, one 512byte sector needs 8byte protection data, and
the max sector number for one request is 2560(BLK_DEF_MAX_SECTORS),
so the max protection data size is just 20k.

The usual case is that one bio builds one single bip segment. Attribute
to bio split, bio merge is seldom done for big IO, and it is only done
in case of small bios. And protection data segment number is usually
same with bio count in the request, so the number won't be very big,
and allocating from slab is fast enough.

Reduce to pre-allocate one sg entry for protection data, and switch
to runtime allocation in case that the protection data segment number
is bigger than 1. Then we can save huge pre-alocation, for example,
500+MB is saved on single lpfc HBA.

Cc: Christoph Hellwig 
Cc: Bart Van Assche 
Cc: Ewan D. Milne 
Cc: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 drivers/scsi/scsi_lib.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c37263c123eb..2eaba41655de 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -39,6 +39,12 @@
 #include "scsi_priv.h"
 #include "scsi_logging.h"
 
+/*
+ * Size of integrity metadata is usually small, 1 inline sg should
+ * cover normal cases.
+ */
+#define  SCSI_INLINE_PROT_SG_CNT  1
+
 static struct kmem_cache *scsi_sdb_cache;
 static struct kmem_cache *scsi_sense_cache;
 static struct kmem_cache *scsi_sense_isadma_cache;
@@ -543,7 +549,8 @@ static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd)
if (cmd->sdb.table.nents)
sg_free_table_chained(&cmd->sdb.table, SG_CHUNK_SIZE);
if (scsi_prot_sg_count(cmd))
-   sg_free_table_chained(&cmd->prot_sdb->table, SG_CHUNK_SIZE);
+   sg_free_table_chained(&cmd->prot_sdb->table,
+   SCSI_INLINE_PROT_SG_CNT);
 }
 
 static void scsi_mq_uninit_cmd(struct scsi_cmnd *cmd)
@@ -1032,7 +1039,7 @@ blk_status_t scsi_init_io(struct scsi_cmnd *cmd)
 
if (sg_alloc_table_chained(&prot_sdb->table, ivecs,
prot_sdb->table.sgl,
-   SG_CHUNK_SIZE)) {
+   SCSI_INLINE_PROT_SG_CNT)) {
ret = BLK_STS_RESOURCE;
goto out_free_sgtables;
}
@@ -1820,7 +1827,8 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
sgl_size = scsi_mq_sgl_size(shost);
cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size + sgl_size;
if (scsi_host_get_prot(shost))
-   cmd_size += sizeof(struct scsi_data_buffer) + sgl_size;
+   cmd_size += sizeof(struct scsi_data_buffer) +
+   sizeof(struct scatterlist) * SCSI_INLINE_PROT_SG_CNT;
 
memset(&shost->tag_set, 0, sizeof(shost->tag_set));
shost->tag_set.ops = &scsi_mq_ops;
-- 
2.9.5

[PATCH V4 3/3] scsi: core: avoid to pre-allocate big chunk for sg list

2019-04-28 Thread Ming Lei

Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list,
and the buffer size is scsi_mq_sgl_size() which depends on smaller
value between shost->sg_tablesize and SG_CHUNK_SIZE.

Modern HBA's DMA is often capable of deadling with very big segment
number, so scsi_mq_sgl_size() is often big. Suppose the max sg number
of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be 4KB.

Then if one HBA has lots of queues, and each hw queue's depth is
high, pre-allocation for sg list can consume huge memory.
For example of lpfc, nr_hw_queues can be 70, each queue's depth
can be 3781, so the pre-allocation for data sg list is 70*3781*2k
=517MB for single HBA.

There is Red Hat internal report that scsi_debug based tests can't
be run any more since legacy io path is killed because too big
pre-allocation.

So switch to runtime allocation for sg list, meantime pre-allocate 2
inline sg entries. This way has been applied to NVMe PCI for a while,
so it should be fine for SCSI too. Also runtime sg entries allocation
has verified and run always in the original legacy io path.

Not see performance effect in my big BS test on scsi_debug.

Cc: Christoph Hellwig 
Cc: Bart Van Assche 
Cc: Ewan D. Milne 
Cc: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 drivers/scsi/scsi_lib.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 2eaba41655de..472d848f1778 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -45,6 +45,8 @@
  */
 #define  SCSI_INLINE_PROT_SG_CNT  1
 
+#define  SCSI_INLINE_SG_CNT  2
+
 static struct kmem_cache *scsi_sdb_cache;
 static struct kmem_cache *scsi_sense_cache;
 static struct kmem_cache *scsi_sense_isadma_cache;
@@ -547,7 +549,8 @@ static void scsi_uninit_cmd(struct scsi_cmnd *cmd)
 static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd)
 {
if (cmd->sdb.table.nents)
-   sg_free_table_chained(&cmd->sdb.table, SG_CHUNK_SIZE);
+   sg_free_table_chained(&cmd->sdb.table,
+   SCSI_INLINE_SG_CNT);
if (scsi_prot_sg_count(cmd))
sg_free_table_chained(&cmd->prot_sdb->table,
SCSI_INLINE_PROT_SG_CNT);
@@ -984,7 +987,7 @@ static blk_status_t scsi_init_sgtable(struct request *req,
 */
if (unlikely(sg_alloc_table_chained(&sdb->table,
blk_rq_nr_phys_segments(req), sdb->table.sgl,
-   SG_CHUNK_SIZE)))
+   SCSI_INLINE_SG_CNT)))
return BLK_STS_RESOURCE;
 
/* 
@@ -1550,9 +1553,9 @@ static int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
 }
 
 /* Size in bytes of the sg-list stored in the scsi-mq command-private data. */
-static unsigned int scsi_mq_sgl_size(struct Scsi_Host *shost)
+static unsigned int scsi_mq_inline_sgl_size(struct Scsi_Host *shost)
 {
-   return min_t(unsigned int, shost->sg_tablesize, SG_CHUNK_SIZE) *
+   return min_t(unsigned int, shost->sg_tablesize, SCSI_INLINE_SG_CNT) *
sizeof(struct scatterlist);
 }
 
@@ -1730,7 +1733,7 @@ static int scsi_mq_init_request(struct blk_mq_tag_set 
*set, struct request *rq,
if (scsi_host_get_prot(shost)) {
sg = (void *)cmd + sizeof(struct scsi_cmnd) +
shost->hostt->cmd_size;
-   cmd->prot_sdb = (void *)sg + scsi_mq_sgl_size(shost);
+   cmd->prot_sdb = (void *)sg + scsi_mq_inline_sgl_size(shost);
}
 
return 0;
@@ -1824,7 +1827,7 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
 {
unsigned int cmd_size, sgl_size;
 
-   sgl_size = scsi_mq_sgl_size(shost);
+   sgl_size = scsi_mq_inline_sgl_size(shost);
cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size + sgl_size;
if (scsi_host_get_prot(shost))
cmd_size += sizeof(struct scsi_data_buffer) +
-- 
2.9.5

[PATCH V8 2/7] blk-mq: move cancel of requeue_work into blk_mq_release

2019-04-28 Thread Ming Lei

With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.

So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.

Cc: Dongli Zhang 
Cc: James Smart 
Cc: Bart Van Assche 
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Bart Van Assche 
Reviewed-by: Johannes Thumshirn 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Tested-by: James Smart 
Signed-off-by: Ming Lei 
---
 block/blk-core.c | 1 -
 block/blk-mq.c   | 2 ++
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a55389ba8779..93dc588fabe2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -237,7 +237,6 @@ void blk_sync_queue(struct request_queue *q)
struct blk_mq_hw_ctx *hctx;
int i;
 
-   cancel_delayed_work_sync(&q->requeue_work);
queue_for_each_hw_ctx(q, hctx, i)
cancel_delayed_work_sync(&hctx->run_work);
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index fc60ed7e940e..89781309a108 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2634,6 +2634,8 @@ void blk_mq_release(struct request_queue *q)
struct blk_mq_hw_ctx *hctx;
unsigned int i;
 
+   cancel_delayed_work_sync(&q->requeue_work);
+
/* hctx kobj stays in hctx */
queue_for_each_hw_ctx(q, hctx, i) {
if (!hctx)
-- 
2.9.5

[PATCH V8 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path

2019-04-28 Thread Ming Lei

Just like aio/io_uring, we need to grab 2 refcount for queuing one
request, one is for submission, another is for completion.

If the request isn't queued from plug code path, the refcount grabbed
in generic_make_request() serves for submission. In theroy, this
refcount should have been released after the sumission(async run queue)
is done. blk_freeze_queue() works with blk_sync_queue() together
for avoiding race between cleanup queue and IO submission, given async
run queue activities are canceled because hctx->run_work is scheduled with
the refcount held, so it is fine to not hold the refcount when
running the run queue work function for dispatch IO.

However, if request is staggered into plug list, and finally queued
from plug code path, the refcount in submission side is actually missed.
And we may start to run queue after queue is removed because the queue's
kobject refcount isn't guaranteed to be grabbed in flushing plug list
context, then kernel oops is triggered, see the following race:

blk_mq_flush_plug_list():
blk_mq_sched_insert_requests()
insert requests to sw queue or scheduler queue
blk_mq_run_hw_queue

Because of concurrent run queue, all requests inserted above may be
completed before calling the above blk_mq_run_hw_queue. Then queue can
be freed during the above blk_mq_run_hw_queue().

Fixes the issue by grab .q_usage_counter before calling
blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
safe because the queue is absolutely alive before inserting request.

Cc: Dongli Zhang 
Cc: James Smart 
Cc: Bart Van Assche 
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Bart Van Assche 
Reviewed-by: Johannes Thumshirn 
Reviewed-by: Hannes Reinecke 
Tested-by: James Smart 
Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index aa6bc5c02643..dfe83e7935d6 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -414,6 +414,13 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx 
*hctx,
 {
struct elevator_queue *e;
 
+   /*
+* blk_mq_sched_insert_requests() is called from flush plug
+* context only, and hold one usage counter to prevent queue
+* from being released.
+*/
+   percpu_ref_get(&hctx->queue->q_usage_counter);
+
e = hctx->queue->elevator;
if (e && e->type->ops.insert_requests)
e->type->ops.insert_requests(hctx, list, false);
@@ -432,6 +439,8 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx 
*hctx,
}
 
blk_mq_run_hw_queue(hctx, run_queue_async);
+
+   percpu_ref_put(&hctx->queue->q_usage_counter);
 }
 
 static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set,
-- 
2.9.5

[PATCH V8 5/7] blk-mq: always free hctx after request queue is freed

2019-04-28 Thread Ming Lei

In normal queue cleanup path, hctx is released after request queue
is freed, see blk_mq_release().

However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
of hw queues shrinking. This way is easy to cause use-after-free,
because: one implicit rule is that it is safe to call almost all block
layer APIs if the request queue is alive; and one hctx may be retrieved
by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
finally use-after-free is triggered.

Fixes this issue by always freeing hctx after releasing request queue.
If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
a per-queue list to hold them, then try to resuse these hctxs if numa
node is matched.

Cc: Dongli Zhang 
Cc: James Smart 
Cc: Bart Van Assche 
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Hannes Reinecke 
Tested-by: James Smart 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 46 +-
 include/linux/blk-mq.h |  2 ++
 include/linux/blkdev.h |  7 +++
 3 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 44ecca6b0cac..a37090038c96 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2268,6 +2268,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
set->ops->exit_hctx(hctx, hctx_idx);
 
blk_mq_remove_cpuhp(hctx);
+
+   spin_lock(&q->unused_hctx_lock);
+   list_add(&hctx->hctx_list, &q->unused_hctx_list);
+   spin_unlock(&q->unused_hctx_lock);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2355,6 +2359,8 @@ blk_mq_alloc_hctx(struct request_queue *q,
hctx->queue = q;
hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
+   INIT_LIST_HEAD(&hctx->hctx_list);
+
/*
 * Allocate space for all possible cpus to avoid allocation at
 * runtime
@@ -2668,15 +2674,17 @@ static int blk_mq_alloc_ctxs(struct request_queue *q)
  */
 void blk_mq_release(struct request_queue *q)
 {
-   struct blk_mq_hw_ctx *hctx;
-   unsigned int i;
+   struct blk_mq_hw_ctx *hctx, *next;
+   int i;
 
cancel_delayed_work_sync(&q->requeue_work);
 
-   /* hctx kobj stays in hctx */
-   queue_for_each_hw_ctx(q, hctx, i) {
-   if (!hctx)
-   continue;
+   queue_for_each_hw_ctx(q, hctx, i)
+   WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list));
+
+   /* all hctx are in .unused_hctx_list now */
+   list_for_each_entry_safe(hctx, next, &q->unused_hctx_list, hctx_list) {
+   list_del_init(&hctx->hctx_list);
kobject_put(&hctx->kobj);
}
 
@@ -2743,9 +2751,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx(
struct blk_mq_tag_set *set, struct request_queue *q,
int hctx_idx, int node)
 {
-   struct blk_mq_hw_ctx *hctx;
+   struct blk_mq_hw_ctx *hctx = NULL, *tmp;
 
-   hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
+   /* reuse dead hctx first */
+   spin_lock(&q->unused_hctx_lock);
+   list_for_each_entry(tmp, &q->unused_hctx_list, hctx_list) {
+   if (tmp->numa_node == node) {
+   hctx = tmp;
+   break;
+   }
+   }
+   if (hctx)
+   list_del_init(&hctx->hctx_list);
+   spin_unlock(&q->unused_hctx_lock);
+
+   if (!hctx)
+   hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node);
if (!hctx)
goto fail;
 
@@ -2783,10 +2804,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set 
*set,
 
hctx = blk_mq_alloc_and_init_hctx(set, q, i, node);
if (hctx) {
-   if (hctxs[i]) {
+   if (hctxs[i])
blk_mq_exit_hctx(q, set, hctxs[i], i);
-   kobject_put(&hctxs[i]->kobj);
-   }
hctxs[i] = hctx;
} else {
if (hctxs[i])
@@ -2817,9 +2836,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set 
*set,
if (hctx->tags)
blk_mq_free_map_and_requests(set, j);
blk_mq_exit_hctx(q, set, hctx, j);
-   kobject_put(&hctx->kobj);
hctxs[j] = NULL;
-
}
}
mutex_unlock(&q->sysfs_lock);
@@ -2862,6 +2879,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
if (!q->queue_hw_ctx)
goto err_sys_init;
 
+   INIT_LIST_HEAD(&q->unused_hctx_list);
+   spin_lock_init(&q->unused_hctx_lock);
+
blk_mq_realloc_hw_ctxs(set, q);
if (!q->nr_hw_queues)
goto err_hctxs;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d

[PATCH V8 4/7] blk-mq: split blk_mq_alloc_and_init_hctx into two parts

2019-04-28 Thread Ming Lei

Split blk_mq_alloc_and_init_hctx into two parts, and one is
blk_mq_alloc_hctx() for allocating all hctx resources, another
is blk_mq_init_hctx() for initializing hctx, which serves as
counter-part of blk_mq_exit_hctx().

Cc: Dongli Zhang 
Cc: James Smart 
Cc: Bart Van Assche 
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Hannes Reinecke 
Tested-by: James Smart 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 138 -
 1 file changed, 77 insertions(+), 61 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d98cb9614dfa..44ecca6b0cac 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2284,15 +2284,70 @@ static void blk_mq_exit_hw_queues(struct request_queue 
*q,
}
 }
 
+static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
+{
+   int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
+
+   BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
+  __alignof__(struct blk_mq_hw_ctx)) !=
+sizeof(struct blk_mq_hw_ctx));
+
+   if (tag_set->flags & BLK_MQ_F_BLOCKING)
+   hw_ctx_size += sizeof(struct srcu_struct);
+
+   return hw_ctx_size;
+}
+
 static int blk_mq_init_hctx(struct request_queue *q,
struct blk_mq_tag_set *set,
struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
 {
-   int node;
+   hctx->queue_num = hctx_idx;
 
-   node = hctx->numa_node;
+   cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
+
+   hctx->tags = set->tags[hctx_idx];
+
+   if (set->ops->init_hctx &&
+   set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
+   goto unregister_cpu_notifier;
+
+   if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx,
+   hctx->numa_node))
+   goto exit_hctx;
+   return 0;
+
+ exit_hctx:
+   if (set->ops->exit_hctx)
+   set->ops->exit_hctx(hctx, hctx_idx);
+ unregister_cpu_notifier:
+   blk_mq_remove_cpuhp(hctx);
+   return -1;
+}
+
+static struct blk_mq_hw_ctx *
+blk_mq_alloc_hctx(struct request_queue *q,
+   struct blk_mq_tag_set *set,
+   unsigned hctx_idx, int node)
+{
+   struct blk_mq_hw_ctx *hctx;
+
+   hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
+   GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+   node);
+   if (!hctx)
+   goto fail_alloc_hctx;
+
+   if (!zalloc_cpumask_var_node(&hctx->cpumask,
+   GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+   node))
+   goto free_hctx;
+
+   atomic_set(&hctx->nr_active, 0);
+   hctx->numa_node = node;
if (node == NUMA_NO_NODE)
-   node = hctx->numa_node = set->numa_node;
+   hctx->numa_node = set->numa_node;
+   node = hctx->numa_node;
 
INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
spin_lock_init(&hctx->lock);
@@ -2300,10 +2355,6 @@ static int blk_mq_init_hctx(struct request_queue *q,
hctx->queue = q;
hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
-   cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
-
-   hctx->tags = set->tags[hctx_idx];
-
/*
 * Allocate space for all possible cpus to avoid allocation at
 * runtime
@@ -2311,47 +2362,38 @@ static int blk_mq_init_hctx(struct request_queue *q,
hctx->ctxs = kmalloc_array_node(nr_cpu_ids, sizeof(void *),
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node);
if (!hctx->ctxs)
-   goto unregister_cpu_notifier;
+   goto free_cpumask;
 
if (sbitmap_init_node(&hctx->ctx_map, nr_cpu_ids, ilog2(8),
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node))
goto free_ctxs;
-
hctx->nr_ctx = 0;
 
spin_lock_init(&hctx->dispatch_wait_lock);
init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake);
INIT_LIST_HEAD(&hctx->dispatch_wait.entry);
 
-   if (set->ops->init_hctx &&
-   set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
-   goto free_bitmap;
-
hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size,
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY);
if (!hctx->fq)
-   goto exit_hctx;
-
-   if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node))
-   goto free_fq;
+   goto free_bitmap;
 
if (hctx->flags & BLK_MQ_F_BLOCKING)
init_srcu_struct(hctx->srcu);
+   blk_mq_hctx_kobj_init(hctx);
 
-   return 0;
+   return hctx;
 
- free_fq:
-   blk_free_flush_queue(hctx->fq);
- exit_hctx:
-   if (set->ops->exit_hctx)
-   set->ops-

[PATCH V8 3/7] blk-mq: free hw queue's resource in hctx's release handler

2019-04-28 Thread Ming Lei

Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization 
is done")
c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and 
this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang 
Cc: James Smart 
Cc: Bart Van Assche 
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reported-by: James Smart 
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: sta...@vger.kernel.org
Reviewed-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Tested-by: James Smart 
Signed-off-by: Ming Lei 
---
 block/blk-core.c | 2 +-
 block/blk-mq-sysfs.c | 6 ++
 block/blk-mq.c   | 8 ++--
 block/blk-mq.h   | 2 +-
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 93dc588fabe2..2dd94b3e9ece 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -374,7 +374,7 @@ void blk_cleanup_queue(struct request_queue *q)
blk_exit_queue(q);
 
if (queue_is_mq(q))
-   blk_mq_free_queue(q);
+   blk_mq_exit_queue(q);
 
percpu_ref_exit(&q->q_usage_counter);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 3f9c3f4ac44c..4040e62c3737 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -10,6 +10,7 @@
 #include 
 
 #include 
+#include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
@@ -33,6 +34,11 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
 {
struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
  kobj);
+
+   if (hctx->flags & BLK_MQ_F_BLOCKING)
+   cleanup_srcu_struct(hctx->srcu);
+   blk_free_flush_queue(hctx->fq);
+   sbitmap_free(&hctx->ctx_map);
free_cpumask_var(hctx->cpumask);
kfree(hctx->ctxs);
kfree(hctx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 89781309a108..d98cb9614dfa 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2267,12 +2267,7 @@ static void blk_mq_exit_hctx(struct request_queue *q,
if (set->ops->exit_hctx)
set->ops->exit_hctx(hctx, hctx_idx);
 
-   if (hctx->flags & BLK_MQ_F_BLOCKING)
-   cleanup_srcu_struct(hctx->srcu);
-
blk_mq_remove_cpuhp(hctx);
-   blk_free_flush_queue(hctx->fq);
-   sbitmap_free(&hctx->ctx_map);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -2907,7 +2902,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
 }
 EXPORT_SYMBOL(blk_mq_init_allocated_queue);
 
-void blk_mq_free_queue(struct request_queue *q)
+/* tags can _not_ be used after returning from blk_mq_exit_queue */
+void blk_mq_exit_queue(struct request_queue *q)
 {
struct blk_mq_tag_set   *set = q->tag_set;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 423ea88ab6fb..633a5a77ee8b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -37,7 +37,7 @@ struct blk_mq_ctx {
struct kobject  kobj;
 } cacheline_aligned_in_smp;
 
-void blk_mq_free_queue(struct request_queue *q);
+void blk_mq_exit_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
 bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
-- 
2.9.5

[PATCH V8 0/7] blk-mq: fix races related with freeing queue

2019-04-28 Thread Ming Lei

Hi,

Since 45a9c9d909b2 ("blk-mq: Fix a use-after-free"), run queue isn't
allowed during cleanup queue even though queue refcount is held.

This change has caused lots of kernel oops triggered in run queue path,
turns out it isn't easy to fix them all.

So move freeing of hw queue resources into hctx's release handler, then
the above issue is fixed. Meantime, this way is safe given freeing hw
queue resource doesn't require tags.

V3 covers more races.

V8:
- merge the 4th and 5th of V7 into one patch, as suggested by Christoph
- drop the 9th patch

V7:
- add reviewed-by and tested-by tag
- rename "dead_hctx" as "unused_hctx"
- check if there are live hctx in queue's release handler
- only patch 6 is modified

V6:
- remove previous SCSI patch which will be routed via SCSI tree
- add reviewed-by tag
- fix one related NVMe scan vs reset race

V5:
- refactor blk_mq_alloc_and_init_hctx()
- fix race related updating nr_hw_queues by always freeing hctx
  after request queue is released

V4:
- add patch for fixing potential use-after-free in 
blk_mq_update_nr_hw_queues
- fix comment in the last patch

V3:
- cancel q->requeue_work in queue's release handler
- cancel hctx->run_work in hctx's release handler
- add patch 1 for fixing race in plug code path
- the last patch is added for avoiding to grab SCSI's refcont
in IO path

V2:
- moving freeing hw queue resources into hctx's release handler


Ming Lei (7):
  blk-mq: grab .q_usage_counter when queuing request from plug code path
  blk-mq: move cancel of requeue_work into blk_mq_release
  blk-mq: free hw queue's resource in hctx's release handler
  blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  blk-mq: always free hctx after request queue is freed
  blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  block: don't drain in-progress dispatch in blk_cleanup_queue()

 block/blk-core.c   |  23 +-
 block/blk-mq-sched.c   |   9 +++
 block/blk-mq-sysfs.c   |   8 +++
 block/blk-mq.c | 188 +
 block/blk-mq.h |   2 +-
 include/linux/blk-mq.h |   2 +
 include/linux/blkdev.h |   7 ++
 7 files changed, 139 insertions(+), 100 deletions(-)

Cc: Dongli Zhang 
Cc: James Smart 
Cc: Bart Van Assche 
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
-- 
2.9.5

[PATCH V8 7/7] block: don't drain in-progress dispatch in blk_cleanup_queue()

2019-04-28 Thread Ming Lei

Now freeing hw queue resource is moved to hctx's release handler,
we don't need to worry about the race between blk_cleanup_queue and
run queue any more.

So don't drain in-progress dispatch in blk_cleanup_queue().

This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before
freeing queue").

Cc: Dongli Zhang 
Cc: James Smart 
Cc: Bart Van Assche 
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Bart Van Assche 
Reviewed-by: Hannes Reinecke 
Tested-by: James Smart 
Signed-off-by: Ming Lei 
---
 block/blk-core.c | 12 
 1 file changed, 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f5b5f21ae4fd..e24cfcefdc19 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -338,18 +338,6 @@ void blk_cleanup_queue(struct request_queue *q)
 
blk_queue_flag_set(QUEUE_FLAG_DEAD, q);
 
-   /*
-* make sure all in-progress dispatch are completed because
-* blk_freeze_queue() can only complete all requests, and
-* dispatch may still be in-progress since we dispatch requests
-* from more than one contexts.
-*
-* We rely on driver to deal with the race in case that queue
-* initialization isn't done.
-*/
-   if (queue_is_mq(q) && blk_queue_init_done(q))
-   blk_mq_quiesce_queue(q);
-
/* for synchronous bio-based driver finish in-flight integrity i/o */
blk_flush_integrity();
 
-- 
2.9.5

[PATCH V8 6/7] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release

2019-04-28 Thread Ming Lei

hctx is always released after requeue is freed.

With holding queue's kobject refcount, it is safe for driver to run queue,
so one run queue might be scheduled after blk_sync_queue() is done.

So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release()
for avoiding run released queue.

Cc: Dongli Zhang 
Cc: James Smart 
Cc: Bart Van Assche 
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Bart Van Assche 
Reviewed-by: Hannes Reinecke 
Tested-by: James Smart 
Signed-off-by: Ming Lei 
---
 block/blk-core.c | 8 
 block/blk-mq-sysfs.c | 2 ++
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2dd94b3e9ece..f5b5f21ae4fd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -232,14 +232,6 @@ void blk_sync_queue(struct request_queue *q)
 {
del_timer_sync(&q->timeout);
cancel_work_sync(&q->timeout_work);
-
-   if (queue_is_mq(q)) {
-   struct blk_mq_hw_ctx *hctx;
-   int i;
-
-   queue_for_each_hw_ctx(q, hctx, i)
-   cancel_delayed_work_sync(&hctx->run_work);
-   }
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 4040e62c3737..25c0d0a6a556 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -35,6 +35,8 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj)
struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
  kobj);
 
+   cancel_delayed_work_sync(&hctx->run_work);
+
if (hctx->flags & BLK_MQ_F_BLOCKING)
cleanup_srcu_struct(hctx->srcu);
blk_free_flush_queue(hctx->fq);
-- 
2.9.5

Re: [PATCH V8 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path

2019-04-28 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH V8 4/7] blk-mq: split blk_mq_alloc_and_init_hctx into two parts

2019-04-28 Thread Christoph Hellwig

> +static struct blk_mq_hw_ctx *
> +blk_mq_alloc_hctx(struct request_queue *q,
> + struct blk_mq_tag_set *set,

Nit: The second paramter would easily fit on the first line.

> + unsigned hctx_idx, int node)
> +{
> + struct blk_mq_hw_ctx *hctx;
> +
> + hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
> + GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
> + node);
> + if (!hctx)
> + goto fail_alloc_hctx;
> +
> + if (!zalloc_cpumask_var_node(&hctx->cpumask,
> + GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
> + node))

Nit: I still think a local variable for the gfp_t would be very useful
here.

> + atomic_set(&hctx->nr_active, 0);
> + hctx->numa_node = node;
>   if (node == NUMA_NO_NODE)
> - node = hctx->numa_node = set->numa_node;
> + hctx->numa_node = set->numa_node;
> + node = hctx->numa_node;

Why not:

if (node == NUMA_NO_NODE)
set->numa_node;
hctx->numa_node = node;

?

Otherwise looks fine:


Reviewed-by: Christoph Hellwig

Re: [PATCH V8 5/7] blk-mq: always free hctx after request queue is freed

2019-04-28 Thread Christoph Hellwig

On Sun, Apr 28, 2019 at 04:14:06PM +0800, Ming Lei wrote:
> In normal queue cleanup path, hctx is released after request queue
> is freed, see blk_mq_release().
> 
> However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
> of hw queues shrinking. This way is easy to cause use-after-free,
> because: one implicit rule is that it is safe to call almost all block
> layer APIs if the request queue is alive; and one hctx may be retrieved
> by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
> finally use-after-free is triggered.
> 
> Fixes this issue by always freeing hctx after releasing request queue.
> If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
> a per-queue list to hold them, then try to resuse these hctxs if numa
> node is matched.

This seems a little odd.  Wouldn't it be much simpler to just keep
the hctx where it is, that is leave the queue_hw_ctx[] pointer in tact,
but have a flag marking it dead?

Re: [PATCH V8 5/7] blk-mq: always free hctx after request queue is freed

2019-04-28 Thread Ming Lei

On Sun, Apr 28, 2019 at 02:14:26PM +0200, Christoph Hellwig wrote:
> On Sun, Apr 28, 2019 at 04:14:06PM +0800, Ming Lei wrote:
> > In normal queue cleanup path, hctx is released after request queue
> > is freed, see blk_mq_release().
> > 
> > However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
> > of hw queues shrinking. This way is easy to cause use-after-free,
> > because: one implicit rule is that it is safe to call almost all block
> > layer APIs if the request queue is alive; and one hctx may be retrieved
> > by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
> > finally use-after-free is triggered.
> > 
> > Fixes this issue by always freeing hctx after releasing request queue.
> > If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
> > a per-queue list to hold them, then try to resuse these hctxs if numa
> > node is matched.
> 
> This seems a little odd.  Wouldn't it be much simpler to just keep
> the hctx where it is, that is leave the queue_hw_ctx[] pointer in tact,
> but have a flag marking it dead?

There are several issues with that solution:

1) q->nr_hw_queues becomes not same with number of the real active hw queues

2) if one hctx is only marked as dead and not freed, after several times of
updating nr_hw_queues, more and more hctx instances will be wasted.

3) q->queue_hw_ctx[] has to be re-allocated when nr_hw_queues is increased.

So I think this patch should be simpler, either in concept or implementation.


Thanks,
Ming

Re: [RFC PATCH 1/4] fdomain: Resurrect driver (core)

2019-04-28 Thread Ondrej Zary

On Wednesday 24 April 2019 08:02:12 Christoph Hellwig wrote:
> > +static void fdomain_work(struct work_struct *work)
> > +{
> > +   struct fdomain *fd = container_of(work, struct fdomain, work);
> > +   struct Scsi_Host *sh = container_of((void *)fd, struct Scsi_Host,
> > +   hostdata);
> 
> This looks odd.  We should never need a void cast for container_of.

This cast is present in all drivers involving container_of, struct Scsi_Host 
and hostdata. hostdata in struct Scsi_Host is defined as "unsigned long 
hostdata[0]"...

-- 
Ondrej Zary

[PATCH 0/4] fdomain: Resurrect driver (modular version)

2019-04-28 Thread Ondrej Zary



Resurrect previously removed fdomain driver, in modern style.
Initialization is rewritten completely, with support for multiple cards,
no more global state variables.
Most of the code from interrupt handler is moved to a workqueue.

This is a modularized version with core separated from bus-specific drivers
(PCI, ISA and PCMCIA). Only PCI driver is tested for now. The other two could
be dropped until they get tested.

Changes since RFC:
 - multi-line comment style, some coding style
 - usage of scsi_k(un)map_atomic_sg for buffer access
 - static marking of fdomain_host_reset and fdomain_template
 - IRQ lookup moved to ISA bus driver
 - EXPORT_SYMBOL_GPL

-- 
Ondrej Zary

[PATCH 3/4] fdomain: Resurrect driver (ISA support)

2019-04-28 Thread Ondrej Zary

Future Domain 16xx ISA SCSI support card support.

Currently untested.

Signed-off-by: Ondrej Zary 
---
 drivers/scsi/Kconfig   |  14 +++
 drivers/scsi/Makefile  |   1 +
 drivers/scsi/fdomain_isa.c | 231 +
 3 files changed, 246 insertions(+)
 create mode 100644 drivers/scsi/fdomain_isa.c

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index f9d058a07e2a..2ea77dafbc00 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -684,6 +684,20 @@ config SCSI_FDOMAIN_PCI
  To compile this driver as a module, choose M here: the
  module will be called fdomain_pci.
 
+config SCSI_FDOMAIN_ISA
+   tristate "Future Domain 16xx ISA SCSI support"
+   depends on ISA && SCSI
+   select CHECK_SIGNATURE
+   select SCSI_FDOMAIN
+   help
+ This is support for Future Domain's 16-bit SCSI host adapters
+ (TMC-1660/1680, TMC-1650/1670, TMC-1610M/MER/MEX) and other adapters
+ with ISA bus based on the Future Domain chipsets (Quantum ISA-200S,
+ ISA-250MG; and at least one IBM board).
+
+ To compile this driver as a module, choose M here: the
+ module will be called fdomain_isa.
+
 config SCSI_GDTH
tristate "Intel/ICP (former GDT SCSI Disk Array) RAID Controller 
support"
depends on PCI && SCSI
diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index f6cc4fbe6957..2e2c777d 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -78,6 +78,7 @@ obj-$(CONFIG_SCSI_ISCI)   += isci/
 obj-$(CONFIG_SCSI_IPS) += ips.o
 obj-$(CONFIG_SCSI_FDOMAIN) += fdomain.o
 obj-$(CONFIG_SCSI_FDOMAIN_PCI) += fdomain_pci.o
+obj-$(CONFIG_SCSI_FDOMAIN_ISA) += fdomain_isa.o
 obj-$(CONFIG_SCSI_GENERIC_NCR5380) += g_NCR5380.o
 obj-$(CONFIG_SCSI_QLOGIC_FAS)  += qlogicfas408.o   qlogicfas.o
 obj-$(CONFIG_PCMCIA_QLOGIC)+= qlogicfas408.o
diff --git a/drivers/scsi/fdomain_isa.c b/drivers/scsi/fdomain_isa.c
new file mode 100644
index ..3008d5118d7a
--- /dev/null
+++ b/drivers/scsi/fdomain_isa.c
@@ -0,0 +1,231 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+#include 
+#include 
+#include "fdomain.h"
+
+#define MAXBOARDS_PARAM 4
+static int io[MAXBOARDS_PARAM] = { 0, 0, 0, 0 };
+module_param_hw_array(io, int, ioport, NULL, 0);
+MODULE_PARM_DESC(io, "base I/O address of controller (0x140, 0x150, 0x160, 
0x170)");
+
+static int irq[MAXBOARDS_PARAM] = { 0, 0, 0, 0 };
+module_param_hw_array(irq, int, irq, NULL, 0);
+MODULE_PARM_DESC(irq, "IRQ of controller (0=auto [default])");
+
+static int scsi_id[MAXBOARDS_PARAM] = { 0, 0, 0, 0 };
+module_param_hw_array(scsi_id, int, other, NULL, 0);
+MODULE_PARM_DESC(scsi_id, "SCSI ID of controller (default = 6)");
+
+static unsigned long addresses[] = {
+   0xc8000,
+   0xca000,
+   0xce000,
+   0xde000,
+};
+#define ADDRESS_COUNT ARRAY_SIZE(addresses)
+
+static unsigned short ports[] = { 0x140, 0x150, 0x160, 0x170 };
+#define PORT_COUNT ARRAY_SIZE(ports)
+
+static unsigned short irqs[] = { 3, 5, 10, 11, 12, 14, 15, 0 };
+
+/* This driver works *ONLY* for Future Domain cards using the TMC-1800,
+ * TMC-18C50, or TMC-18C30 chip.  This includes models TMC-1650, 1660, 1670,
+ * and 1680. These are all 16-bit cards.
+ *
+ * The following BIOS signature signatures are for boards which do *NOT*
+ * work with this driver (these TMC-8xx and TMC-9xx boards may work with the
+ * Seagate driver):
+ *
+ * FUTURE DOMAIN CORP. (C) 1986-1988 V4.0I 03/16/88
+ * FUTURE DOMAIN CORP. (C) 1986-1989 V5.0C2/14/89
+ * FUTURE DOMAIN CORP. (C) 1986-1989 V6.0A7/28/89
+ * FUTURE DOMAIN CORP. (C) 1986-1990 V6.0105/31/90
+ * FUTURE DOMAIN CORP. (C) 1986-1990 V6.0209/18/90
+ * FUTURE DOMAIN CORP. (C) 1986-1990 V7.009/18/90
+ * FUTURE DOMAIN CORP. (C) 1992 V8.00.004/02/92
+ *
+ * (The cards which do *NOT* work are all 8-bit cards -- although some of
+ * them have a 16-bit form-factor, the upper 8-bits are used only for IRQs
+ * and are *NOT* used for data. You can tell the difference by following
+ * the tracings on the circuit board -- if only the IRQ lines are involved,
+ * you have a "8-bit" card, and should *NOT* use this driver.)
+ */
+
+static struct signature signatures[] = {
+/*  1 2 3 4 5 6 */
+/* 123456789012345678901234567890123456789012345678901234567890 */
+{ "FUTURE DOMAIN CORP. (C) 1986-1990 1800-V2.07/28/89", 5, 50,  2,  0, 
0 },
+{ "FUTURE DOMAIN CORP. (C) 1986-1990 1800-V1.07/28/89", 5, 50,  2,  0, 
0 },
+{ "FUTURE DOMAIN CORP. (C) 1986-1990 1800-V2.07/28/89", 72, 50,  2,  0, 2 },
+{ "FUTURE DOMAIN CORP. (C) 1986-1990 1800-V2.0",   73, 43,  2,  0, 3 },
+{ "FUTURE DOMAIN CORP. (C) 1991 1800-V2.0.",   72, 39,  2,  0, 4 },
+{ "FUTURE DOMAIN CORP. (C) 1992 V3.00.004/02/92",   5, 44,  3,  0, 0 },
+{ "FUTURE DOMAIN TMC-18XX (C) 1993 V3.203/12/93",   5, 44,  3,  2, 0 },
+{ "IBM F1 P2 BIOS v1.0104/29/93",

[PATCH 1/4] fdomain: Resurrect driver (core)

2019-04-28 Thread Ondrej Zary

Future Domain TMC-16xx/TMC-3260 SCSI driver.

This is the core driver, common for PCI, ISA and PCMCIA cards.

Signed-off-by: Ondrej Zary 
---
 drivers/scsi/Kconfig   |   4 +
 drivers/scsi/Makefile  |   1 +
 drivers/scsi/fdomain.c | 600 +
 drivers/scsi/fdomain.h |  63 ++
 4 files changed, 668 insertions(+)
 create mode 100644 drivers/scsi/fdomain.c
 create mode 100644 drivers/scsi/fdomain.h

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index d528018e6fa8..3d6b1f47cbb5 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -663,6 +663,10 @@ config SCSI_DMX3191D
  To compile this driver as a module, choose M here: the
  module will be called dmx3191d.
 
+config SCSI_FDOMAIN
+   tristate
+   depends on SCSI
+
 config SCSI_GDTH
tristate "Intel/ICP (former GDT SCSI Disk Array) RAID Controller 
support"
depends on PCI && SCSI
diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index 8826111fdf4a..b8fbc6d2de54 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -76,6 +76,7 @@ obj-$(CONFIG_SCSI_AIC94XX)+= aic94xx/
 obj-$(CONFIG_SCSI_PM8001)  += pm8001/
 obj-$(CONFIG_SCSI_ISCI)+= isci/
 obj-$(CONFIG_SCSI_IPS) += ips.o
+obj-$(CONFIG_SCSI_FDOMAIN) += fdomain.o
 obj-$(CONFIG_SCSI_GENERIC_NCR5380) += g_NCR5380.o
 obj-$(CONFIG_SCSI_QLOGIC_FAS)  += qlogicfas408.o   qlogicfas.o
 obj-$(CONFIG_PCMCIA_QLOGIC)+= qlogicfas408.o
diff --git a/drivers/scsi/fdomain.c b/drivers/scsi/fdomain.c
new file mode 100644
index ..edf75c7f8e3f
--- /dev/null
+++ b/drivers/scsi/fdomain.c
@@ -0,0 +1,600 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Driver for Future Domain TMC-16x0 and TMC-3260 SCSI host adapters
+ * Copyright 2019 Ondrej Zary
+ *
+ * Original driver by
+ * Rickard E. Faith, fa...@cs.unc.edu
+ *
+ * Future Domain BIOS versions supported for autodetect:
+ *2.0, 3.0, 3.2, 3.4 (1.0), 3.5 (2.0), 3.6, 3.61
+ * Chips supported:
+ *TMC-1800, TMC-18C50, TMC-18C30, TMC-36C70
+ * Boards supported:
+ *Future Domain TMC-1650, TMC-1660, TMC-1670, TMC-1680, TMC-1610M/MER/MEX
+ *Future Domain TMC-3260 (PCI)
+ *Quantum ISA-200S, ISA-250MG
+ *Adaptec AHA-2920A (PCI) [BUT *NOT* AHA-2920C -- use aic7xxx instead]
+ *IBM ?
+ *
+ * NOTE:
+ *
+ * The Adaptec AHA-2920C has an Adaptec AIC-7850 chip on it.
+ * Use the aic7xxx driver for this board.
+ *
+ * The Adaptec AHA-2920A has a Future Domain chip on it, so this is the right
+ * driver for that card.  Unfortunately, the boxes will probably just say
+ * "2920", so you'll have to look on the card for a Future Domain logo, or a
+ * letter after the 2920.
+ *
+ * If you have a TMC-8xx or TMC-9xx board, then this is not the driver for
+ * your board.
+ *
+ * DESCRIPTION:
+ *
+ * This is the Linux low-level SCSI driver for Future Domain TMC-1660/1680
+ * TMC-1650/1670, and TMC-3260 SCSI host adapters.  The 1650 and 1670 have a
+ * 25-pin external connector, whereas the 1660 and 1680 have a SCSI-2 50-pin
+ * high-density external connector.  The 1670 and 1680 have floppy disk
+ * controllers built in.  The TMC-3260 is a PCI bus card.
+ *
+ * Future Domain's older boards are based on the TMC-1800 chip, and this
+ * driver was originally written for a TMC-1680 board with the TMC-1800 chip.
+ * More recently, boards are being produced with the TMC-18C50 and TMC-18C30
+ * chips.
+ *
+ * Please note that the drive ordering that Future Domain implemented in BIOS
+ * versions 3.4 and 3.5 is the opposite of the order (currently) used by the
+ * rest of the SCSI industry.
+ *
+ *
+ * REFERENCES USED:
+ *
+ * "TMC-1800 SCSI Chip Specification (FDC-1800T)", Future Domain Corporation,
+ * 1990.
+ *
+ * "Technical Reference Manual: 18C50 SCSI Host Adapter Chip", Future Domain
+ * Corporation, January 1992.
+ *
+ * "LXT SCSI Products: Specifications and OEM Technical Manual (Revision
+ * B/September 1991)", Maxtor Corporation, 1991.
+ *
+ * "7213S product Manual (Revision P3)", Maxtor Corporation, 1992.
+ *
+ * "Draft Proposed American National Standard: Small Computer System
+ * Interface - 2 (SCSI-2)", Global Engineering Documents. (X3T9.2/86-109,
+ * revision 10h, October 17, 1991)
+ *
+ * Private communications, Drew Eckhardt (d...@cs.colorado.edu) and Eric
+ * Youngdale (er...@cais.com), 1992.
+ *
+ * Private communication, Tuong Le (Future Domain Engineering department),
+ * 1994. (Disk geometry computations for Future Domain BIOS version 3.4, and
+ * TMC-18C30 detection.)
+ *
+ * Hogan, Thom. The Programmer's PC Sourcebook. Microsoft Press, 1988. Page
+ * 60 (2.39: Disk Partition Table Layout).
+ *
+ * "18C30 Technical Reference Manual", Future Domain Corporation, 1993, page
+ * 6-1.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "fdomain.h"
+
+/*
+ * FIFO_COUNT: The host adapter has an 8K cache (host adapters based on the
+ * 18

[PATCH 4/4] fdomain: Resurrect driver (PCMCIA support)

2019-04-28 Thread Ondrej Zary

Future Domain PCMCIA SCSI support card support.

Currently untested.

Signed-off-by: Ondrej Zary 
---
 drivers/scsi/pcmcia/Kconfig  | 10 +
 drivers/scsi/pcmcia/Makefile |  1 +
 drivers/scsi/pcmcia/fdomain_cs.c | 89 
 3 files changed, 100 insertions(+)
 create mode 100644 drivers/scsi/pcmcia/fdomain_cs.c

diff --git a/drivers/scsi/pcmcia/Kconfig b/drivers/scsi/pcmcia/Kconfig
index 2d435f105b16..169d93f90a30 100644
--- a/drivers/scsi/pcmcia/Kconfig
+++ b/drivers/scsi/pcmcia/Kconfig
@@ -19,6 +19,16 @@ config PCMCIA_AHA152X
  To compile this driver as a module, choose M here: the
  module will be called aha152x_cs.
 
+config PCMCIA_FDOMAIN
+   tristate "Future Domain PCMCIA support"
+   select SCSI_FDOMAIN
+   help
+ Say Y here if you intend to attach this type of PCMCIA SCSI host
+ adapter to your computer.
+
+ To compile this driver as a module, choose M here: the
+ module will be called fdomain_cs.
+
 config PCMCIA_NINJA_SCSI
tristate "NinjaSCSI-3 / NinjaSCSI-32Bi (16bit) PCMCIA support"
depends on !64BIT
diff --git a/drivers/scsi/pcmcia/Makefile b/drivers/scsi/pcmcia/Makefile
index a5a24dd44e7e..02f5b44a2685 100644
--- a/drivers/scsi/pcmcia/Makefile
+++ b/drivers/scsi/pcmcia/Makefile
@@ -4,6 +4,7 @@ ccflags-y   := -I $(srctree)/drivers/scsi
 
 # 16-bit client drivers
 obj-$(CONFIG_PCMCIA_QLOGIC)+= qlogic_cs.o
+obj-$(CONFIG_PCMCIA_FDOMAIN)   += fdomain_cs.o
 obj-$(CONFIG_PCMCIA_AHA152X)   += aha152x_cs.o
 obj-$(CONFIG_PCMCIA_NINJA_SCSI)+= nsp_cs.o
 obj-$(CONFIG_PCMCIA_SYM53C500) += sym53c500_cs.o
diff --git a/drivers/scsi/pcmcia/fdomain_cs.c b/drivers/scsi/pcmcia/fdomain_cs.c
new file mode 100644
index ..eee1379ad67b
--- /dev/null
+++ b/drivers/scsi/pcmcia/fdomain_cs.c
@@ -0,0 +1,89 @@
+// SPDX-License-Identifier: (GPL-2.0 OR MPL-1.1)
+/*
+ * Driver for Future Domain-compatible PCMCIA SCSI cards
+ * Copyright 2019 Ondrej Zary
+ *
+ * The initial developer of the original code is David A. Hinds
+ * .  Portions created by David A. Hinds
+ * are Copyright (C) 1999 David A. Hinds.  All Rights Reserved.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "fdomain.h"
+
+MODULE_AUTHOR("Ondrej Zary, David Hinds");
+MODULE_DESCRIPTION("Future Domain PCMCIA SCSI driver");
+MODULE_LICENSE("Dual MPL/GPL");
+
+static int fdomain_config_check(struct pcmcia_device *p_dev, void *priv_data)
+{
+   p_dev->io_lines = 10;
+   p_dev->resource[0]->end = 0x10;
+   p_dev->resource[0]->flags &= ~IO_DATA_PATH_WIDTH;
+   p_dev->resource[0]->flags |= IO_DATA_PATH_WIDTH_AUTO;
+   return pcmcia_request_io(p_dev);
+}
+
+static int fdomain_probe(struct pcmcia_device *link)
+{
+   int ret;
+   struct Scsi_Host *sh;
+
+   link->config_flags |= CONF_ENABLE_IRQ | CONF_AUTO_SET_IO;
+   link->config_regs = PRESENT_OPTION;
+
+   ret = pcmcia_loop_config(link, fdomain_config_check, NULL);
+   if (ret)
+   return ret;
+
+   ret = pcmcia_enable_device(link);
+   if (ret)
+   goto fail;
+
+   sh = fdomain_create(link->resource[0]->start, link->irq, 7, 0, NULL,
+   &link->dev);
+   if (!sh) {
+   dev_err(&link->dev, "Controller initialization failed");
+   ret = -ENODEV;
+   goto fail;
+   }
+
+   link->priv = sh;
+
+   return 0;
+
+fail:
+   pcmcia_disable_device(link);
+   return ret;
+}
+
+static void fdomain_remove(struct pcmcia_device *link)
+{
+   fdomain_destroy(link->priv);
+   pcmcia_disable_device(link);
+}
+
+static const struct pcmcia_device_id fdomain_ids[] = {
+   PCMCIA_DEVICE_PROD_ID12("IBM Corp.", "SCSI PCMCIA Card", 0xe3736c88,
+   0x859cad20),
+   PCMCIA_DEVICE_PROD_ID1("SCSI PCMCIA Adapter Card", 0x8dacb57e),
+   PCMCIA_DEVICE_PROD_ID12(" SIMPLE TECHNOLOGY Corporation",
+   "SCSI PCMCIA Credit Card Controller",
+   0x182bdafe, 0xc80d106f),
+   PCMCIA_DEVICE_NULL,
+};
+MODULE_DEVICE_TABLE(pcmcia, fdomain_ids);
+
+static struct pcmcia_driver fdomain_cs_driver = {
+   .owner  = THIS_MODULE,
+   .name   = "fdomain_cs",
+   .probe  = fdomain_probe,
+   .remove = fdomain_remove,
+   .id_table   = fdomain_ids,
+};
+
+module_pcmcia_driver(fdomain_cs_driver);
-- 
Ondrej Zary

[PATCH 2/4] fdomain: Resurrect driver (PCI support)

2019-04-28 Thread Ondrej Zary

Future Domain TMC-3260/AHA-2920A PCI card support.

Tested on Adaptec AHA-2920A PCI card.

Signed-off-by: Ondrej Zary 
---
 drivers/scsi/Kconfig   | 17 
 drivers/scsi/Makefile  |  1 +
 drivers/scsi/fdomain_pci.c | 68 ++
 3 files changed, 86 insertions(+)
 create mode 100644 drivers/scsi/fdomain_pci.c

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index 3d6b1f47cbb5..f9d058a07e2a 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -667,6 +667,23 @@ config SCSI_FDOMAIN
tristate
depends on SCSI
 
+config SCSI_FDOMAIN_PCI
+   tristate "Future Domain TMC-3260/AHA-2920A PCI SCSI support"
+   depends on PCI && SCSI
+   select SCSI_FDOMAIN
+   help
+ This is support for Future Domain's PCI SCSI host adapters (TMC-3260)
+ and other adapters with PCI bus based on the Future Domain chipsets
+ (Adaptec AHA-2920A).
+
+ NOTE: Newer Adaptec AHA-2920C boards use the Adaptec AIC-7850 chip
+ and should use the aic7xxx driver ("Adaptec AIC7xxx chipset SCSI
+ controller support"). This Future Domain driver works with the older
+ Adaptec AHA-2920A boards with a Future Domain chip on them.
+
+ To compile this driver as a module, choose M here: the
+ module will be called fdomain_pci.
+
 config SCSI_GDTH
tristate "Intel/ICP (former GDT SCSI Disk Array) RAID Controller 
support"
depends on PCI && SCSI
diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index b8fbc6d2de54..f6cc4fbe6957 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -77,6 +77,7 @@ obj-$(CONFIG_SCSI_PM8001) += pm8001/
 obj-$(CONFIG_SCSI_ISCI)+= isci/
 obj-$(CONFIG_SCSI_IPS) += ips.o
 obj-$(CONFIG_SCSI_FDOMAIN) += fdomain.o
+obj-$(CONFIG_SCSI_FDOMAIN_PCI) += fdomain_pci.o
 obj-$(CONFIG_SCSI_GENERIC_NCR5380) += g_NCR5380.o
 obj-$(CONFIG_SCSI_QLOGIC_FAS)  += qlogicfas408.o   qlogicfas.o
 obj-$(CONFIG_PCMCIA_QLOGIC)+= qlogicfas408.o
diff --git a/drivers/scsi/fdomain_pci.c b/drivers/scsi/fdomain_pci.c
new file mode 100644
index ..381a7157c078
--- /dev/null
+++ b/drivers/scsi/fdomain_pci.c
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+#include "fdomain.h"
+
+static int fdomain_pci_probe(struct pci_dev *pdev,
+const struct pci_device_id *d)
+{
+   int err;
+   struct Scsi_Host *sh;
+
+   err = pci_enable_device(pdev);
+   if (err)
+   goto fail;
+
+   err = pci_request_regions(pdev, "fdomain_pci");
+   if (err)
+   goto disable_device;
+
+   err = -ENODEV;
+   if (pci_resource_len(pdev, 0) == 0)
+   goto release_region;
+
+   sh = fdomain_create(pci_resource_start(pdev, 0), pdev->irq, 7, 0, NULL,
+   &pdev->dev);
+   if (!sh)
+   goto release_region;
+
+   pci_set_drvdata(pdev, sh);
+   return 0;
+
+release_region:
+   pci_release_regions(pdev);
+disable_device:
+   pci_disable_device(pdev);
+fail:
+   return err;
+}
+
+static void fdomain_pci_remove(struct pci_dev *pdev)
+{
+   struct Scsi_Host *sh = pci_get_drvdata(pdev);
+
+   fdomain_destroy(sh);
+   pci_release_regions(pdev);
+   pci_disable_device(pdev);
+}
+
+static struct pci_device_id fdomain_pci_table[] = {
+   { PCI_DEVICE(PCI_VENDOR_ID_FD, PCI_DEVICE_ID_FD_36C70) },
+   {}
+};
+MODULE_DEVICE_TABLE(pci, fdomain_pci_table);
+
+static struct pci_driver fdomain_pci_driver = {
+   .name   = "fdomain_pci",
+   .id_table   = fdomain_pci_table,
+   .probe  = fdomain_pci_probe,
+   .remove = fdomain_pci_remove,
+   .driver.pm  = FDOMAIN_PM_OPS,
+};
+
+module_pci_driver(fdomain_pci_driver);
+
+MODULE_AUTHOR("Ondrej Zary, Rickard E. Faith");
+MODULE_DESCRIPTION("Future Domain TMC-3260 PCI SCSI driver");
+MODULE_LICENSE("GPL");
-- 
Ondrej Zary

Re: [PATCH V8 4/7] blk-mq: split blk_mq_alloc_and_init_hctx into two parts

2019-04-28 Thread Hannes Reinecke


On 4/28/19 10:14 AM, Ming Lei wrote:

Split blk_mq_alloc_and_init_hctx into two parts, and one is
blk_mq_alloc_hctx() for allocating all hctx resources, another
is blk_mq_init_hctx() for initializing hctx, which serves as
counter-part of blk_mq_exit_hctx().

Cc: Dongli Zhang 
Cc: James Smart 
Cc: Bart Van Assche 
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Hannes Reinecke 
Tested-by: James Smart 
Signed-off-by: Ming Lei 
---
  block/blk-mq.c | 138 -
  1 file changed, 77 insertions(+), 61 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d98cb9614dfa..44ecca6b0cac 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2284,15 +2284,70 @@ static void blk_mq_exit_hw_queues(struct request_queue 
*q,
}
  }
  
+static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)

+{
+   int hw_ctx_size = sizeof(struct blk_mq_hw_ctx);
+
+   BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu),
+  __alignof__(struct blk_mq_hw_ctx)) !=
+sizeof(struct blk_mq_hw_ctx));
+
+   if (tag_set->flags & BLK_MQ_F_BLOCKING)
+   hw_ctx_size += sizeof(struct srcu_struct);
+
+   return hw_ctx_size;
+}
+
  static int blk_mq_init_hctx(struct request_queue *q,
struct blk_mq_tag_set *set,
struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
  {
-   int node;
+   hctx->queue_num = hctx_idx;
  
-	node = hctx->numa_node;

+   cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
+
+   hctx->tags = set->tags[hctx_idx];
+
+   if (set->ops->init_hctx &&
+   set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
+   goto unregister_cpu_notifier;
+
+   if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx,
+   hctx->numa_node))
+   goto exit_hctx;
+   return 0;
+
+ exit_hctx:
+   if (set->ops->exit_hctx)
+   set->ops->exit_hctx(hctx, hctx_idx);
+ unregister_cpu_notifier:
+   blk_mq_remove_cpuhp(hctx);
+   return -1;
+}
+
+static struct blk_mq_hw_ctx *
+blk_mq_alloc_hctx(struct request_queue *q,
+   struct blk_mq_tag_set *set,
+   unsigned hctx_idx, int node)
+{
+   struct blk_mq_hw_ctx *hctx;
+
+   hctx = kzalloc_node(blk_mq_hw_ctx_size(set),
+   GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+   node);
+   if (!hctx)
+   goto fail_alloc_hctx;
+
+   if (!zalloc_cpumask_var_node(&hctx->cpumask,
+   GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY,
+   node))
+   goto free_hctx;
+
+   atomic_set(&hctx->nr_active, 0);
+   hctx->numa_node = node;
if (node == NUMA_NO_NODE)
-   node = hctx->numa_node = set->numa_node;
+   hctx->numa_node = set->numa_node;
+   node = hctx->numa_node;
  
  	INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);

spin_lock_init(&hctx->lock);
The 'hctx_idx' argument is now unused, and should be removed from the 
function definition.


Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

Re: Kernel panic w/ message request_threaded_irq -> qla2x00_request_irqs -> qla2x00_probe_one -> mod_timer

[PATCH V4 0/3] scsi: core: avoid big pre-allocation for sg list

[PATCH V4 1/3] lib/sg_pool.c: improve APIs for allocating sg pool

Re: [PATCH V4 1/3] lib/sg_pool.c: improve APIs for allocating sg pool

[PATCH V4 2/3] scsi: core: avoid to pre-allocate big chunk for protection meta data

[PATCH V4 3/3] scsi: core: avoid to pre-allocate big chunk for sg list

[PATCH V8 2/7] blk-mq: move cancel of requeue_work into blk_mq_release

[PATCH V8 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path

[PATCH V8 5/7] blk-mq: always free hctx after request queue is freed

[PATCH V8 4/7] blk-mq: split blk_mq_alloc_and_init_hctx into two parts

[PATCH V8 3/7] blk-mq: free hw queue's resource in hctx's release handler

[PATCH V8 0/7] blk-mq: fix races related with freeing queue

[PATCH V8 7/7] block: don't drain in-progress dispatch in blk_cleanup_queue()

[PATCH V8 6/7] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release

Re: [PATCH V8 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path

Re: [PATCH V8 4/7] blk-mq: split blk_mq_alloc_and_init_hctx into two parts

Re: [PATCH V8 5/7] blk-mq: always free hctx after request queue is freed

Re: [PATCH V8 5/7] blk-mq: always free hctx after request queue is freed

Re: [RFC PATCH 1/4] fdomain: Resurrect driver (core)

[PATCH 0/4] fdomain: Resurrect driver (modular version)

[PATCH 3/4] fdomain: Resurrect driver (ISA support)

[PATCH 1/4] fdomain: Resurrect driver (core)

[PATCH 4/4] fdomain: Resurrect driver (PCMCIA support)

[PATCH 2/4] fdomain: Resurrect driver (PCI support)

Re: [PATCH V8 4/7] blk-mq: split blk_mq_alloc_and_init_hctx into two parts

25 matches

Site Navigation

Mail list logo

Footer information