Re: Kernel panic w/ message request_threaded_irq -> qla2x00_request_irqs -> qla2x00_probe_one -> mod_timer
On 4/15/2019 10:26 PM, TomK wrote: On 4/15/2019 3:35 PM, Laurence Oberman wrote: On Mon, 2019-04-15 at 08:39 -0700, Bart Van Assche wrote: On Mon, 2019-04-15 at 08:55 -0400, Laurence Oberman wrote: On Sun, 2019-04-14 at 23:25 -0400, TomK wrote: Hey All, I'm getting a kernel panic on an Gigabyte GA-890XA-UD3 motherboard that I've got a QLE2464 card in as a target (FC). The kernel has been crashing / panicking in the last 1-2 months about once a week. Before that, it was rock solid for 4-5 years. I've upgraded to kernel 4.18.19 but that hasn't made much of a difference. Since the message includes qla2x00_request_irqs I thought I would try here first. Tried to get more info on this but: 1) Keyboard doesn't work and locks up when the panic occurs. No USB ports work. Tried the PS/2 port but nothing. 2) Unable to capture a kdump. Can't get to the kdump vmcore due to 1). The two screenshots is pretty much all I can capture. Tried things like clocksource=rtc in the kernel parms and disabling hpet1 but apparently I haven't disabled it everywhere since it still shows up. Wondering if anyone recognizes these messages or has any idea what could be the issue here? Even a hint would be appreciated. Hello Tom I have had similar issues and reported them to Himanshu@Cavium I have kept all my target servers at kernel 4.5 as it been the only version that has always been stable. If your motherboard has an NMI (virtual or physical) set all of these in /etc/sysctl.conf Run sysctl -a;dracut -f and reboot kernel.nmi_watchdog = 1 kernel.panic_on_io_nmi = 1 kernel.panic_on_unrecovered_nmi = kernel.unknown_nmi_panic = 1 When the issue shows up press the virtual/physical NMI This is with the assumption that generic kdump is properly setup and dmesg | grep crash shows memory resrved by the crashkernel and that you have tested kdump manually. Other options are use a USB serial port to capture the full log if you cannot get kdump to work. That approach may provide further evidence about kernel bugs but it is not guaranteed that that approach will lead to a solution. It would help if either or both of you could do the following on a test system: * Check out branch qla2xxx-for-next of my kernel repo on github (https://github.com/bvanassche/linux/tree/qla2xxx-for-next). * Enable lockdep and KASAN in the kernel config (CONFIG_PROVE_LOCKING and CONFIG_KASAN). * Build and install that kernel. * Run your favorite workload. Please note that the qla2xxx-for-next branch is based on the v5.1-rc1 kernel and hence should not be installed on any production system. Thanks, Bart. Hello Bart OK, I will get to this by Thursday, wont be able to change the targetserver kernel until then. Regards Laurence Same. I'll try this out closer to the weekend. Not an NMI motherboard. This is a 9-10 year old AMD board meant as a desktop or home server. I'll have to read more about the USB Serial port to capture further info. That's interesting. For the time being, I've disabled HPET in BIOS. ( Appears the kernel boot parameter method wasn't enough. ) Hey Guy's, Did some of what you suggested, including the USB serial setup: 1) One of DB9 RS232 Serial Null Modem Cable F/F 2) Two of USB to RS232 Serial Port DB9 9 Pin however, when the kernel came down it took the USB support with it and so minicom went offline: CTRL-A Z for help |115200 8N1 | NOR | Minicom 2.6.2 | VT102 | Offline But I did enable full logging for the QLA module: echo 0x7fff > /sys/module/qla2xxx/parameters/ql2xextended_error_logging Did all that, minus the Kernel v5.1-rc1 implementation, and this is what was picked up from the minicom USB to Serial capture before things went south: 1235905 ^Mqla2xxx [:04:00.0]-e818: is_send_status=1, cmd->bufflen=512, cmd->sg_cnt=1, cmd-> dma_data_directi on=1 se_cmd[ 9c9ea758] qp 0 1235906 ^Mqla2xxx [:04:00.0]-e818: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=0, cmd- >dma_data_direct ion=2 se_cmd[000 096ae11b7] q p 0 1235907 ^Mqla2xxx [:04:00.0]-e818: is_send_status=1, cmd->bufflen=20480, cmd->sg_cnt=0, cmd ->dma_data_direc tion=2 se_cmd[00 001738f793] qp 0 1235908 ^Mqla2xxx [:04:00.0]-e818: is_send_status=1, cmd->bufflen=20480, cmd->sg_cnt=0, cmd ->dma_data_direc tion=2 se_cmd[00 00e8160a90] qp 0 1235909 ^MDetected MISCOMPARE for addr: 33045258 buf: f9849912 1235910 ^MTarget/fileio: Send MISCOMPARE check condition and sense 1235911 ^Mqla2xxx [:04:00.0]-e818: is_send_status=1, cmd->bufflen=512, cmd->sg_cnt=0, cmd-> dma_data_directi on=2 se_cmd[ 363ae214] qp 0 1235912 ^Mqla2xxx [:04:00.0]-e817: Skipping EXPLICIT_CONFORM and CTIO7_FLAGS_CONFORM_REQ fo r FCP READ w/ no n GOOD status 1235913 ^Mqla2xxx [:04:00.0]-e874:2: qlt_free_cmd: se_cmd[1db805fd] ox_id 00c8 1235914 ^Mqla2xxx [:04:00.0]-e872:2: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 00db 1235915 ^Mqla2xxx [:04:00.0]
[PATCH V4 0/3] scsi: core: avoid big pre-allocation for sg list
Hi, Since supporting to blk-mq, big pre-allocation for sg list is introduced, this way is very unfriendly wrt. memory consumption. There were Red Hat internal reports that some scsi_debug based tests can't be run any more because of too big pre-allocation. Also lpfc users commplained that 1GB+ ram is pre-allocatd for single HBA. sg_alloc_table_chained() is improved to support variant size of 1st pre-allocated SGL in the 1st patch as suggested by Christoph. The other two patches try to address this issue by allocating sg list runtime, meantime pre-allocating one or two inline sg entries for small IO. This ways follows NVMe's approach wrt. sg list allocation. V4: - add parameter to sg_alloc_table_chained()/sg_free_table_chained() directly, and update current callers V3: - improve sg_alloc_table_chained() to accept variant size of the 1st pre-allocated SGL - applies the improved sg API to address the big pre-allocation issue V2: - move inline sg table initializetion into one helper - introduce new helper for getting inline sg - comment log fix Ming Lei (3): lib/sg_pool.c: improve APIs for allocating sg pool scsi: core: avoid to pre-allocate big chunk for protection meta data scsi: core: avoid to pre-allocate big chunk for sg list drivers/nvme/host/fc.c| 7 --- drivers/nvme/host/rdma.c | 7 --- drivers/nvme/target/loop.c| 4 ++-- drivers/scsi/scsi_lib.c | 31 ++- include/linux/scatterlist.h | 11 +++ lib/scatterlist.c | 36 +++- lib/sg_pool.c | 37 +++-- net/sunrpc/xprtrdma/svc_rdma_rw.c | 5 +++-- 8 files changed, 92 insertions(+), 46 deletions(-) Cc: Christoph Hellwig Cc: Bart Van Assche Cc: Ewan D. Milne Cc: Hannes Reinecke Cc: Sagi Grimberg Cc: Chuck Lever Cc: net...@vger.kernel.org Cc: linux-n...@lists.infradead.org -- 2.9.5
[PATCH V4 1/3] lib/sg_pool.c: improve APIs for allocating sg pool
Now sg_alloc_table_chained() allows user to provide one preallocated SGL, and returns simply if the requested number isn't bigger than size of that SGL. This way is nice for inline small SGL to small IO request. However, scattergather code only allows that size of the 1st preallocated SGL is SG_CHUNK_SIZE(128), and this way isn't flexiable and useful, because it may take too much memory(4KB) to pre-allocat one such size SGL for each IO request, especially block layer always pre-allocates IO request structure. Instead it is more friendly to pre-allocate one small size inline SGL just for small IO. Introduces one extra parameter to sg_alloc_table_chained() and sg_free_table_chained() for specifying size of the pre-allocated SGL, then the 'first_chunk' SGL can include any number of entries. Both __sg_free_table() and __sg_alloc_table() assumes that each SGL has same size except for the last one, changes code to allow both to accept variant size for the 1st preallocated SGL. Cc: Christoph Hellwig Cc: Bart Van Assche Cc: Ewan D. Milne Cc: Hannes Reinecke Cc: Sagi Grimberg Cc: Chuck Lever Cc: net...@vger.kernel.org Cc: linux-n...@lists.infradead.org Suggested-by: Christoph Hellwig Signed-off-by: Ming Lei --- drivers/nvme/host/fc.c| 7 --- drivers/nvme/host/rdma.c | 7 --- drivers/nvme/target/loop.c| 4 ++-- drivers/scsi/scsi_lib.c | 10 ++ include/linux/scatterlist.h | 11 +++ lib/scatterlist.c | 36 +++- lib/sg_pool.c | 37 +++-- net/sunrpc/xprtrdma/svc_rdma_rw.c | 5 +++-- 8 files changed, 76 insertions(+), 41 deletions(-) diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index 31637f8ef22e..9b42ef82a273 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -2112,7 +2112,8 @@ nvme_fc_map_data(struct nvme_fc_ctrl *ctrl, struct request *rq, freq->sg_table.sgl = freq->first_sgl; ret = sg_alloc_table_chained(&freq->sg_table, - blk_rq_nr_phys_segments(rq), freq->sg_table.sgl); + blk_rq_nr_phys_segments(rq), freq->sg_table.sgl, + SG_CHUNK_SIZE); if (ret) return -ENOMEM; @@ -2122,7 +2123,7 @@ nvme_fc_map_data(struct nvme_fc_ctrl *ctrl, struct request *rq, freq->sg_cnt = fc_dma_map_sg(ctrl->lport->dev, freq->sg_table.sgl, op->nents, dir); if (unlikely(freq->sg_cnt <= 0)) { - sg_free_table_chained(&freq->sg_table, true); + sg_free_table_chained(&freq->sg_table, SG_CHUNK_SIZE); freq->sg_cnt = 0; return -EFAULT; } @@ -2148,7 +2149,7 @@ nvme_fc_unmap_data(struct nvme_fc_ctrl *ctrl, struct request *rq, nvme_cleanup_cmd(rq); - sg_free_table_chained(&freq->sg_table, true); + sg_free_table_chained(&freq->sg_table, SG_CHUNK_SIZE); freq->sg_cnt = 0; } diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 11a5ecae78c8..2022a196d2a7 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -1155,7 +1155,7 @@ static void nvme_rdma_unmap_data(struct nvme_rdma_queue *queue, WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE); nvme_cleanup_cmd(rq); - sg_free_table_chained(&req->sg_table, true); + sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE); } static int nvme_rdma_set_sg_null(struct nvme_command *c) @@ -1270,7 +1270,8 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue, req->sg_table.sgl = req->first_sgl; ret = sg_alloc_table_chained(&req->sg_table, - blk_rq_nr_phys_segments(rq), req->sg_table.sgl); + blk_rq_nr_phys_segments(rq), req->sg_table.sgl, + SG_CHUNK_SIZE); if (ret) return -ENOMEM; @@ -1310,7 +1311,7 @@ static int nvme_rdma_map_data(struct nvme_rdma_queue *queue, req->nents, rq_data_dir(rq) == WRITE ? DMA_TO_DEVICE : DMA_FROM_DEVICE); out_free_table: - sg_free_table_chained(&req->sg_table, true); + sg_free_table_chained(&req->sg_table, SG_CHUNK_SIZE); return ret; } diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c index b9f623ab01f3..b3e2f42a3644 100644 --- a/drivers/nvme/target/loop.c +++ b/drivers/nvme/target/loop.c @@ -77,7 +77,7 @@ static void nvme_loop_complete_rq(struct request *req) struct nvme_loop_iod *iod = blk_mq_rq_to_pdu(req); nvme_cleanup_cmd(req); - sg_free_table_chained(&iod->sg_table, true); + sg_free_table_chained(&iod->sg_table, SG_CHUNK_SIZE); nvme_complete_rq(req); } @@ -171,7 +171,7 @@ static blk_status_t nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx, iod->sg_table.sgl = iod->fir
Re: [PATCH V4 1/3] lib/sg_pool.c: improve APIs for allocating sg pool
Looks good: Reviewed-by: Christoph Hellwig
[PATCH V4 2/3] scsi: core: avoid to pre-allocate big chunk for protection meta data
Now scsi_mq_setup_tags() pre-allocates a big buffer for protection sg entries, and the buffer size is scsi_mq_sgl_size(). This way isn't correct, scsi_mq_sgl_size() is used to pre-allocate sg entries for IO data. And the protection data buffer is much less, for example, one 512byte sector needs 8byte protection data, and the max sector number for one request is 2560(BLK_DEF_MAX_SECTORS), so the max protection data size is just 20k. The usual case is that one bio builds one single bip segment. Attribute to bio split, bio merge is seldom done for big IO, and it is only done in case of small bios. And protection data segment number is usually same with bio count in the request, so the number won't be very big, and allocating from slab is fast enough. Reduce to pre-allocate one sg entry for protection data, and switch to runtime allocation in case that the protection data segment number is bigger than 1. Then we can save huge pre-alocation, for example, 500+MB is saved on single lpfc HBA. Cc: Christoph Hellwig Cc: Bart Van Assche Cc: Ewan D. Milne Cc: Hannes Reinecke Reviewed-by: Christoph Hellwig Signed-off-by: Ming Lei --- drivers/scsi/scsi_lib.c | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index c37263c123eb..2eaba41655de 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -39,6 +39,12 @@ #include "scsi_priv.h" #include "scsi_logging.h" +/* + * Size of integrity metadata is usually small, 1 inline sg should + * cover normal cases. + */ +#define SCSI_INLINE_PROT_SG_CNT 1 + static struct kmem_cache *scsi_sdb_cache; static struct kmem_cache *scsi_sense_cache; static struct kmem_cache *scsi_sense_isadma_cache; @@ -543,7 +549,8 @@ static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd) if (cmd->sdb.table.nents) sg_free_table_chained(&cmd->sdb.table, SG_CHUNK_SIZE); if (scsi_prot_sg_count(cmd)) - sg_free_table_chained(&cmd->prot_sdb->table, SG_CHUNK_SIZE); + sg_free_table_chained(&cmd->prot_sdb->table, + SCSI_INLINE_PROT_SG_CNT); } static void scsi_mq_uninit_cmd(struct scsi_cmnd *cmd) @@ -1032,7 +1039,7 @@ blk_status_t scsi_init_io(struct scsi_cmnd *cmd) if (sg_alloc_table_chained(&prot_sdb->table, ivecs, prot_sdb->table.sgl, - SG_CHUNK_SIZE)) { + SCSI_INLINE_PROT_SG_CNT)) { ret = BLK_STS_RESOURCE; goto out_free_sgtables; } @@ -1820,7 +1827,8 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost) sgl_size = scsi_mq_sgl_size(shost); cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size + sgl_size; if (scsi_host_get_prot(shost)) - cmd_size += sizeof(struct scsi_data_buffer) + sgl_size; + cmd_size += sizeof(struct scsi_data_buffer) + + sizeof(struct scatterlist) * SCSI_INLINE_PROT_SG_CNT; memset(&shost->tag_set, 0, sizeof(shost->tag_set)); shost->tag_set.ops = &scsi_mq_ops; -- 2.9.5
[PATCH V4 3/3] scsi: core: avoid to pre-allocate big chunk for sg list
Now scsi_mq_setup_tags() pre-allocates a big buffer for IO sg list, and the buffer size is scsi_mq_sgl_size() which depends on smaller value between shost->sg_tablesize and SG_CHUNK_SIZE. Modern HBA's DMA is often capable of deadling with very big segment number, so scsi_mq_sgl_size() is often big. Suppose the max sg number of SG_CHUNK_SIZE is taken, scsi_mq_sgl_size() will be 4KB. Then if one HBA has lots of queues, and each hw queue's depth is high, pre-allocation for sg list can consume huge memory. For example of lpfc, nr_hw_queues can be 70, each queue's depth can be 3781, so the pre-allocation for data sg list is 70*3781*2k =517MB for single HBA. There is Red Hat internal report that scsi_debug based tests can't be run any more since legacy io path is killed because too big pre-allocation. So switch to runtime allocation for sg list, meantime pre-allocate 2 inline sg entries. This way has been applied to NVMe PCI for a while, so it should be fine for SCSI too. Also runtime sg entries allocation has verified and run always in the original legacy io path. Not see performance effect in my big BS test on scsi_debug. Cc: Christoph Hellwig Cc: Bart Van Assche Cc: Ewan D. Milne Cc: Hannes Reinecke Reviewed-by: Christoph Hellwig Signed-off-by: Ming Lei --- drivers/scsi/scsi_lib.c | 15 +-- 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 2eaba41655de..472d848f1778 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -45,6 +45,8 @@ */ #define SCSI_INLINE_PROT_SG_CNT 1 +#define SCSI_INLINE_SG_CNT 2 + static struct kmem_cache *scsi_sdb_cache; static struct kmem_cache *scsi_sense_cache; static struct kmem_cache *scsi_sense_isadma_cache; @@ -547,7 +549,8 @@ static void scsi_uninit_cmd(struct scsi_cmnd *cmd) static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd) { if (cmd->sdb.table.nents) - sg_free_table_chained(&cmd->sdb.table, SG_CHUNK_SIZE); + sg_free_table_chained(&cmd->sdb.table, + SCSI_INLINE_SG_CNT); if (scsi_prot_sg_count(cmd)) sg_free_table_chained(&cmd->prot_sdb->table, SCSI_INLINE_PROT_SG_CNT); @@ -984,7 +987,7 @@ static blk_status_t scsi_init_sgtable(struct request *req, */ if (unlikely(sg_alloc_table_chained(&sdb->table, blk_rq_nr_phys_segments(req), sdb->table.sgl, - SG_CHUNK_SIZE))) + SCSI_INLINE_SG_CNT))) return BLK_STS_RESOURCE; /* @@ -1550,9 +1553,9 @@ static int scsi_dispatch_cmd(struct scsi_cmnd *cmd) } /* Size in bytes of the sg-list stored in the scsi-mq command-private data. */ -static unsigned int scsi_mq_sgl_size(struct Scsi_Host *shost) +static unsigned int scsi_mq_inline_sgl_size(struct Scsi_Host *shost) { - return min_t(unsigned int, shost->sg_tablesize, SG_CHUNK_SIZE) * + return min_t(unsigned int, shost->sg_tablesize, SCSI_INLINE_SG_CNT) * sizeof(struct scatterlist); } @@ -1730,7 +1733,7 @@ static int scsi_mq_init_request(struct blk_mq_tag_set *set, struct request *rq, if (scsi_host_get_prot(shost)) { sg = (void *)cmd + sizeof(struct scsi_cmnd) + shost->hostt->cmd_size; - cmd->prot_sdb = (void *)sg + scsi_mq_sgl_size(shost); + cmd->prot_sdb = (void *)sg + scsi_mq_inline_sgl_size(shost); } return 0; @@ -1824,7 +1827,7 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost) { unsigned int cmd_size, sgl_size; - sgl_size = scsi_mq_sgl_size(shost); + sgl_size = scsi_mq_inline_sgl_size(shost); cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size + sgl_size; if (scsi_host_get_prot(shost)) cmd_size += sizeof(struct scsi_data_buffer) + -- 2.9.5
[PATCH V8 2/7] blk-mq: move cancel of requeue_work into blk_mq_release
With holding queue's kobject refcount, it is safe for driver to schedule requeue. However, blk_mq_kick_requeue_list() may be called after blk_sync_queue() is done because of concurrent requeue activities, then requeue work may not be completed when freeing queue, and kernel oops is triggered. So moving the cancel of requeue_work into blk_mq_release() for avoiding race between requeue and freeing queue. Cc: Dongli Zhang Cc: James Smart Cc: Bart Van Assche Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen , Cc: Christoph Hellwig , Cc: James E . J . Bottomley , Reviewed-by: Bart Van Assche Reviewed-by: Johannes Thumshirn Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Tested-by: James Smart Signed-off-by: Ming Lei --- block/blk-core.c | 1 - block/blk-mq.c | 2 ++ 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index a55389ba8779..93dc588fabe2 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -237,7 +237,6 @@ void blk_sync_queue(struct request_queue *q) struct blk_mq_hw_ctx *hctx; int i; - cancel_delayed_work_sync(&q->requeue_work); queue_for_each_hw_ctx(q, hctx, i) cancel_delayed_work_sync(&hctx->run_work); } diff --git a/block/blk-mq.c b/block/blk-mq.c index fc60ed7e940e..89781309a108 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2634,6 +2634,8 @@ void blk_mq_release(struct request_queue *q) struct blk_mq_hw_ctx *hctx; unsigned int i; + cancel_delayed_work_sync(&q->requeue_work); + /* hctx kobj stays in hctx */ queue_for_each_hw_ctx(q, hctx, i) { if (!hctx) -- 2.9.5
[PATCH V8 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path
Just like aio/io_uring, we need to grab 2 refcount for queuing one request, one is for submission, another is for completion. If the request isn't queued from plug code path, the refcount grabbed in generic_make_request() serves for submission. In theroy, this refcount should have been released after the sumission(async run queue) is done. blk_freeze_queue() works with blk_sync_queue() together for avoiding race between cleanup queue and IO submission, given async run queue activities are canceled because hctx->run_work is scheduled with the refcount held, so it is fine to not hold the refcount when running the run queue work function for dispatch IO. However, if request is staggered into plug list, and finally queued from plug code path, the refcount in submission side is actually missed. And we may start to run queue after queue is removed because the queue's kobject refcount isn't guaranteed to be grabbed in flushing plug list context, then kernel oops is triggered, see the following race: blk_mq_flush_plug_list(): blk_mq_sched_insert_requests() insert requests to sw queue or scheduler queue blk_mq_run_hw_queue Because of concurrent run queue, all requests inserted above may be completed before calling the above blk_mq_run_hw_queue. Then queue can be freed during the above blk_mq_run_hw_queue(). Fixes the issue by grab .q_usage_counter before calling blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is safe because the queue is absolutely alive before inserting request. Cc: Dongli Zhang Cc: James Smart Cc: Bart Van Assche Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen , Cc: Christoph Hellwig , Cc: James E . J . Bottomley , Reviewed-by: Bart Van Assche Reviewed-by: Johannes Thumshirn Reviewed-by: Hannes Reinecke Tested-by: James Smart Signed-off-by: Ming Lei --- block/blk-mq-sched.c | 9 + 1 file changed, 9 insertions(+) diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index aa6bc5c02643..dfe83e7935d6 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -414,6 +414,13 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx, { struct elevator_queue *e; + /* +* blk_mq_sched_insert_requests() is called from flush plug +* context only, and hold one usage counter to prevent queue +* from being released. +*/ + percpu_ref_get(&hctx->queue->q_usage_counter); + e = hctx->queue->elevator; if (e && e->type->ops.insert_requests) e->type->ops.insert_requests(hctx, list, false); @@ -432,6 +439,8 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx, } blk_mq_run_hw_queue(hctx, run_queue_async); + + percpu_ref_put(&hctx->queue->q_usage_counter); } static void blk_mq_sched_free_tags(struct blk_mq_tag_set *set, -- 2.9.5
[PATCH V8 5/7] blk-mq: always free hctx after request queue is freed
In normal queue cleanup path, hctx is released after request queue is freed, see blk_mq_release(). However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because of hw queues shrinking. This way is easy to cause use-after-free, because: one implicit rule is that it is safe to call almost all block layer APIs if the request queue is alive; and one hctx may be retrieved by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues(); finally use-after-free is triggered. Fixes this issue by always freeing hctx after releasing request queue. If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce a per-queue list to hold them, then try to resuse these hctxs if numa node is matched. Cc: Dongli Zhang Cc: James Smart Cc: Bart Van Assche Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen , Cc: Christoph Hellwig , Cc: James E . J . Bottomley , Reviewed-by: Hannes Reinecke Tested-by: James Smart Signed-off-by: Ming Lei --- block/blk-mq.c | 46 +- include/linux/blk-mq.h | 2 ++ include/linux/blkdev.h | 7 +++ 3 files changed, 42 insertions(+), 13 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 44ecca6b0cac..a37090038c96 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2268,6 +2268,10 @@ static void blk_mq_exit_hctx(struct request_queue *q, set->ops->exit_hctx(hctx, hctx_idx); blk_mq_remove_cpuhp(hctx); + + spin_lock(&q->unused_hctx_lock); + list_add(&hctx->hctx_list, &q->unused_hctx_list); + spin_unlock(&q->unused_hctx_lock); } static void blk_mq_exit_hw_queues(struct request_queue *q, @@ -2355,6 +2359,8 @@ blk_mq_alloc_hctx(struct request_queue *q, hctx->queue = q; hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED; + INIT_LIST_HEAD(&hctx->hctx_list); + /* * Allocate space for all possible cpus to avoid allocation at * runtime @@ -2668,15 +2674,17 @@ static int blk_mq_alloc_ctxs(struct request_queue *q) */ void blk_mq_release(struct request_queue *q) { - struct blk_mq_hw_ctx *hctx; - unsigned int i; + struct blk_mq_hw_ctx *hctx, *next; + int i; cancel_delayed_work_sync(&q->requeue_work); - /* hctx kobj stays in hctx */ - queue_for_each_hw_ctx(q, hctx, i) { - if (!hctx) - continue; + queue_for_each_hw_ctx(q, hctx, i) + WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list)); + + /* all hctx are in .unused_hctx_list now */ + list_for_each_entry_safe(hctx, next, &q->unused_hctx_list, hctx_list) { + list_del_init(&hctx->hctx_list); kobject_put(&hctx->kobj); } @@ -2743,9 +2751,22 @@ static struct blk_mq_hw_ctx *blk_mq_alloc_and_init_hctx( struct blk_mq_tag_set *set, struct request_queue *q, int hctx_idx, int node) { - struct blk_mq_hw_ctx *hctx; + struct blk_mq_hw_ctx *hctx = NULL, *tmp; - hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node); + /* reuse dead hctx first */ + spin_lock(&q->unused_hctx_lock); + list_for_each_entry(tmp, &q->unused_hctx_list, hctx_list) { + if (tmp->numa_node == node) { + hctx = tmp; + break; + } + } + if (hctx) + list_del_init(&hctx->hctx_list); + spin_unlock(&q->unused_hctx_lock); + + if (!hctx) + hctx = blk_mq_alloc_hctx(q, set, hctx_idx, node); if (!hctx) goto fail; @@ -2783,10 +2804,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, hctx = blk_mq_alloc_and_init_hctx(set, q, i, node); if (hctx) { - if (hctxs[i]) { + if (hctxs[i]) blk_mq_exit_hctx(q, set, hctxs[i], i); - kobject_put(&hctxs[i]->kobj); - } hctxs[i] = hctx; } else { if (hctxs[i]) @@ -2817,9 +2836,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set, if (hctx->tags) blk_mq_free_map_and_requests(set, j); blk_mq_exit_hctx(q, set, hctx, j); - kobject_put(&hctx->kobj); hctxs[j] = NULL; - } } mutex_unlock(&q->sysfs_lock); @@ -2862,6 +2879,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, if (!q->queue_hw_ctx) goto err_sys_init; + INIT_LIST_HEAD(&q->unused_hctx_list); + spin_lock_init(&q->unused_hctx_lock); + blk_mq_realloc_hw_ctxs(set, q); if (!q->nr_hw_queues) goto err_hctxs; diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index d
[PATCH V8 4/7] blk-mq: split blk_mq_alloc_and_init_hctx into two parts
Split blk_mq_alloc_and_init_hctx into two parts, and one is blk_mq_alloc_hctx() for allocating all hctx resources, another is blk_mq_init_hctx() for initializing hctx, which serves as counter-part of blk_mq_exit_hctx(). Cc: Dongli Zhang Cc: James Smart Cc: Bart Van Assche Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen , Cc: Christoph Hellwig , Cc: James E . J . Bottomley , Reviewed-by: Hannes Reinecke Tested-by: James Smart Signed-off-by: Ming Lei --- block/blk-mq.c | 138 - 1 file changed, 77 insertions(+), 61 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index d98cb9614dfa..44ecca6b0cac 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2284,15 +2284,70 @@ static void blk_mq_exit_hw_queues(struct request_queue *q, } } +static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set) +{ + int hw_ctx_size = sizeof(struct blk_mq_hw_ctx); + + BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu), + __alignof__(struct blk_mq_hw_ctx)) != +sizeof(struct blk_mq_hw_ctx)); + + if (tag_set->flags & BLK_MQ_F_BLOCKING) + hw_ctx_size += sizeof(struct srcu_struct); + + return hw_ctx_size; +} + static int blk_mq_init_hctx(struct request_queue *q, struct blk_mq_tag_set *set, struct blk_mq_hw_ctx *hctx, unsigned hctx_idx) { - int node; + hctx->queue_num = hctx_idx; - node = hctx->numa_node; + cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead); + + hctx->tags = set->tags[hctx_idx]; + + if (set->ops->init_hctx && + set->ops->init_hctx(hctx, set->driver_data, hctx_idx)) + goto unregister_cpu_notifier; + + if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, + hctx->numa_node)) + goto exit_hctx; + return 0; + + exit_hctx: + if (set->ops->exit_hctx) + set->ops->exit_hctx(hctx, hctx_idx); + unregister_cpu_notifier: + blk_mq_remove_cpuhp(hctx); + return -1; +} + +static struct blk_mq_hw_ctx * +blk_mq_alloc_hctx(struct request_queue *q, + struct blk_mq_tag_set *set, + unsigned hctx_idx, int node) +{ + struct blk_mq_hw_ctx *hctx; + + hctx = kzalloc_node(blk_mq_hw_ctx_size(set), + GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, + node); + if (!hctx) + goto fail_alloc_hctx; + + if (!zalloc_cpumask_var_node(&hctx->cpumask, + GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, + node)) + goto free_hctx; + + atomic_set(&hctx->nr_active, 0); + hctx->numa_node = node; if (node == NUMA_NO_NODE) - node = hctx->numa_node = set->numa_node; + hctx->numa_node = set->numa_node; + node = hctx->numa_node; INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn); spin_lock_init(&hctx->lock); @@ -2300,10 +2355,6 @@ static int blk_mq_init_hctx(struct request_queue *q, hctx->queue = q; hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED; - cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead); - - hctx->tags = set->tags[hctx_idx]; - /* * Allocate space for all possible cpus to avoid allocation at * runtime @@ -2311,47 +2362,38 @@ static int blk_mq_init_hctx(struct request_queue *q, hctx->ctxs = kmalloc_array_node(nr_cpu_ids, sizeof(void *), GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node); if (!hctx->ctxs) - goto unregister_cpu_notifier; + goto free_cpumask; if (sbitmap_init_node(&hctx->ctx_map, nr_cpu_ids, ilog2(8), GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, node)) goto free_ctxs; - hctx->nr_ctx = 0; spin_lock_init(&hctx->dispatch_wait_lock); init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake); INIT_LIST_HEAD(&hctx->dispatch_wait.entry); - if (set->ops->init_hctx && - set->ops->init_hctx(hctx, set->driver_data, hctx_idx)) - goto free_bitmap; - hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size, GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY); if (!hctx->fq) - goto exit_hctx; - - if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node)) - goto free_fq; + goto free_bitmap; if (hctx->flags & BLK_MQ_F_BLOCKING) init_srcu_struct(hctx->srcu); + blk_mq_hctx_kobj_init(hctx); - return 0; + return hctx; - free_fq: - blk_free_flush_queue(hctx->fq); - exit_hctx: - if (set->ops->exit_hctx) - set->ops-
[PATCH V8 3/7] blk-mq: free hw queue's resource in hctx's release handler
Once blk_cleanup_queue() returns, tags shouldn't be used any more, because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2 ("blk-mq: Fix a use-after-free") fixes this issue exactly. However, that commit introduces another issue. Before 45a9c9d909b2, we are allowed to run queue during cleaning up queue if the queue's kobj refcount is held. After that commit, queue can't be run during queue cleaning up, otherwise oops can be triggered easily because some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue(). We have invented ways for addressing this kind of issue before, such as: 8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done") c2856ae2f315 ("blk-mq: quiesce queue before freeing queue") But still can't cover all cases, recently James reports another such kind of issue: https://marc.info/?l=linux-scsi&m=155389088124782&w=2 This issue can be quite hard to address by previous way, given scsi_run_queue() may run requeues for other LUNs. Fixes the above issue by freeing hctx's resources in its release handler, and this way is safe becasue tags isn't needed for freeing such hctx resource. This approach follows typical design pattern wrt. kobject's release handler. Cc: Dongli Zhang Cc: James Smart Cc: Bart Van Assche Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen , Cc: Christoph Hellwig , Cc: James E . J . Bottomley , Reported-by: James Smart Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free") Cc: sta...@vger.kernel.org Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Tested-by: James Smart Signed-off-by: Ming Lei --- block/blk-core.c | 2 +- block/blk-mq-sysfs.c | 6 ++ block/blk-mq.c | 8 ++-- block/blk-mq.h | 2 +- 4 files changed, 10 insertions(+), 8 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 93dc588fabe2..2dd94b3e9ece 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -374,7 +374,7 @@ void blk_cleanup_queue(struct request_queue *q) blk_exit_queue(q); if (queue_is_mq(q)) - blk_mq_free_queue(q); + blk_mq_exit_queue(q); percpu_ref_exit(&q->q_usage_counter); diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 3f9c3f4ac44c..4040e62c3737 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -10,6 +10,7 @@ #include #include +#include "blk.h" #include "blk-mq.h" #include "blk-mq-tag.h" @@ -33,6 +34,11 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj) { struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx, kobj); + + if (hctx->flags & BLK_MQ_F_BLOCKING) + cleanup_srcu_struct(hctx->srcu); + blk_free_flush_queue(hctx->fq); + sbitmap_free(&hctx->ctx_map); free_cpumask_var(hctx->cpumask); kfree(hctx->ctxs); kfree(hctx); diff --git a/block/blk-mq.c b/block/blk-mq.c index 89781309a108..d98cb9614dfa 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2267,12 +2267,7 @@ static void blk_mq_exit_hctx(struct request_queue *q, if (set->ops->exit_hctx) set->ops->exit_hctx(hctx, hctx_idx); - if (hctx->flags & BLK_MQ_F_BLOCKING) - cleanup_srcu_struct(hctx->srcu); - blk_mq_remove_cpuhp(hctx); - blk_free_flush_queue(hctx->fq); - sbitmap_free(&hctx->ctx_map); } static void blk_mq_exit_hw_queues(struct request_queue *q, @@ -2907,7 +2902,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, } EXPORT_SYMBOL(blk_mq_init_allocated_queue); -void blk_mq_free_queue(struct request_queue *q) +/* tags can _not_ be used after returning from blk_mq_exit_queue */ +void blk_mq_exit_queue(struct request_queue *q) { struct blk_mq_tag_set *set = q->tag_set; diff --git a/block/blk-mq.h b/block/blk-mq.h index 423ea88ab6fb..633a5a77ee8b 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -37,7 +37,7 @@ struct blk_mq_ctx { struct kobject kobj; } cacheline_aligned_in_smp; -void blk_mq_free_queue(struct request_queue *q); +void blk_mq_exit_queue(struct request_queue *q); int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr); void blk_mq_wake_waiters(struct request_queue *q); bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool); -- 2.9.5
[PATCH V8 0/7] blk-mq: fix races related with freeing queue
Hi, Since 45a9c9d909b2 ("blk-mq: Fix a use-after-free"), run queue isn't allowed during cleanup queue even though queue refcount is held. This change has caused lots of kernel oops triggered in run queue path, turns out it isn't easy to fix them all. So move freeing of hw queue resources into hctx's release handler, then the above issue is fixed. Meantime, this way is safe given freeing hw queue resource doesn't require tags. V3 covers more races. V8: - merge the 4th and 5th of V7 into one patch, as suggested by Christoph - drop the 9th patch V7: - add reviewed-by and tested-by tag - rename "dead_hctx" as "unused_hctx" - check if there are live hctx in queue's release handler - only patch 6 is modified V6: - remove previous SCSI patch which will be routed via SCSI tree - add reviewed-by tag - fix one related NVMe scan vs reset race V5: - refactor blk_mq_alloc_and_init_hctx() - fix race related updating nr_hw_queues by always freeing hctx after request queue is released V4: - add patch for fixing potential use-after-free in blk_mq_update_nr_hw_queues - fix comment in the last patch V3: - cancel q->requeue_work in queue's release handler - cancel hctx->run_work in hctx's release handler - add patch 1 for fixing race in plug code path - the last patch is added for avoiding to grab SCSI's refcont in IO path V2: - moving freeing hw queue resources into hctx's release handler Ming Lei (7): blk-mq: grab .q_usage_counter when queuing request from plug code path blk-mq: move cancel of requeue_work into blk_mq_release blk-mq: free hw queue's resource in hctx's release handler blk-mq: split blk_mq_alloc_and_init_hctx into two parts blk-mq: always free hctx after request queue is freed blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release block: don't drain in-progress dispatch in blk_cleanup_queue() block/blk-core.c | 23 +- block/blk-mq-sched.c | 9 +++ block/blk-mq-sysfs.c | 8 +++ block/blk-mq.c | 188 + block/blk-mq.h | 2 +- include/linux/blk-mq.h | 2 + include/linux/blkdev.h | 7 ++ 7 files changed, 139 insertions(+), 100 deletions(-) Cc: Dongli Zhang Cc: James Smart Cc: Bart Van Assche Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen , Cc: Christoph Hellwig , Cc: James E . J . Bottomley , -- 2.9.5
[PATCH V8 7/7] block: don't drain in-progress dispatch in blk_cleanup_queue()
Now freeing hw queue resource is moved to hctx's release handler, we don't need to worry about the race between blk_cleanup_queue and run queue any more. So don't drain in-progress dispatch in blk_cleanup_queue(). This is basically revert of c2856ae2f315 ("blk-mq: quiesce queue before freeing queue"). Cc: Dongli Zhang Cc: James Smart Cc: Bart Van Assche Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen , Cc: Christoph Hellwig , Cc: James E . J . Bottomley , Reviewed-by: Bart Van Assche Reviewed-by: Hannes Reinecke Tested-by: James Smart Signed-off-by: Ming Lei --- block/blk-core.c | 12 1 file changed, 12 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index f5b5f21ae4fd..e24cfcefdc19 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -338,18 +338,6 @@ void blk_cleanup_queue(struct request_queue *q) blk_queue_flag_set(QUEUE_FLAG_DEAD, q); - /* -* make sure all in-progress dispatch are completed because -* blk_freeze_queue() can only complete all requests, and -* dispatch may still be in-progress since we dispatch requests -* from more than one contexts. -* -* We rely on driver to deal with the race in case that queue -* initialization isn't done. -*/ - if (queue_is_mq(q) && blk_queue_init_done(q)) - blk_mq_quiesce_queue(q); - /* for synchronous bio-based driver finish in-flight integrity i/o */ blk_flush_integrity(); -- 2.9.5
[PATCH V8 6/7] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
hctx is always released after requeue is freed. With holding queue's kobject refcount, it is safe for driver to run queue, so one run queue might be scheduled after blk_sync_queue() is done. So moving the cancel of hctx->run_work into blk_mq_hw_sysfs_release() for avoiding run released queue. Cc: Dongli Zhang Cc: James Smart Cc: Bart Van Assche Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen , Cc: Christoph Hellwig , Cc: James E . J . Bottomley , Reviewed-by: Bart Van Assche Reviewed-by: Hannes Reinecke Tested-by: James Smart Signed-off-by: Ming Lei --- block/blk-core.c | 8 block/blk-mq-sysfs.c | 2 ++ 2 files changed, 2 insertions(+), 8 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 2dd94b3e9ece..f5b5f21ae4fd 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -232,14 +232,6 @@ void blk_sync_queue(struct request_queue *q) { del_timer_sync(&q->timeout); cancel_work_sync(&q->timeout_work); - - if (queue_is_mq(q)) { - struct blk_mq_hw_ctx *hctx; - int i; - - queue_for_each_hw_ctx(q, hctx, i) - cancel_delayed_work_sync(&hctx->run_work); - } } EXPORT_SYMBOL(blk_sync_queue); diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 4040e62c3737..25c0d0a6a556 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -35,6 +35,8 @@ static void blk_mq_hw_sysfs_release(struct kobject *kobj) struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx, kobj); + cancel_delayed_work_sync(&hctx->run_work); + if (hctx->flags & BLK_MQ_F_BLOCKING) cleanup_srcu_struct(hctx->srcu); blk_free_flush_queue(hctx->fq); -- 2.9.5
Re: [PATCH V8 1/7] blk-mq: grab .q_usage_counter when queuing request from plug code path
Looks good, Reviewed-by: Christoph Hellwig
Re: [PATCH V8 4/7] blk-mq: split blk_mq_alloc_and_init_hctx into two parts
> +static struct blk_mq_hw_ctx * > +blk_mq_alloc_hctx(struct request_queue *q, > + struct blk_mq_tag_set *set, Nit: The second paramter would easily fit on the first line. > + unsigned hctx_idx, int node) > +{ > + struct blk_mq_hw_ctx *hctx; > + > + hctx = kzalloc_node(blk_mq_hw_ctx_size(set), > + GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, > + node); > + if (!hctx) > + goto fail_alloc_hctx; > + > + if (!zalloc_cpumask_var_node(&hctx->cpumask, > + GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, > + node)) Nit: I still think a local variable for the gfp_t would be very useful here. > + atomic_set(&hctx->nr_active, 0); > + hctx->numa_node = node; > if (node == NUMA_NO_NODE) > - node = hctx->numa_node = set->numa_node; > + hctx->numa_node = set->numa_node; > + node = hctx->numa_node; Why not: if (node == NUMA_NO_NODE) set->numa_node; hctx->numa_node = node; ? Otherwise looks fine: Reviewed-by: Christoph Hellwig
Re: [PATCH V8 5/7] blk-mq: always free hctx after request queue is freed
On Sun, Apr 28, 2019 at 04:14:06PM +0800, Ming Lei wrote: > In normal queue cleanup path, hctx is released after request queue > is freed, see blk_mq_release(). > > However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because > of hw queues shrinking. This way is easy to cause use-after-free, > because: one implicit rule is that it is safe to call almost all block > layer APIs if the request queue is alive; and one hctx may be retrieved > by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues(); > finally use-after-free is triggered. > > Fixes this issue by always freeing hctx after releasing request queue. > If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce > a per-queue list to hold them, then try to resuse these hctxs if numa > node is matched. This seems a little odd. Wouldn't it be much simpler to just keep the hctx where it is, that is leave the queue_hw_ctx[] pointer in tact, but have a flag marking it dead?
Re: [PATCH V8 5/7] blk-mq: always free hctx after request queue is freed
On Sun, Apr 28, 2019 at 02:14:26PM +0200, Christoph Hellwig wrote: > On Sun, Apr 28, 2019 at 04:14:06PM +0800, Ming Lei wrote: > > In normal queue cleanup path, hctx is released after request queue > > is freed, see blk_mq_release(). > > > > However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because > > of hw queues shrinking. This way is easy to cause use-after-free, > > because: one implicit rule is that it is safe to call almost all block > > layer APIs if the request queue is alive; and one hctx may be retrieved > > by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues(); > > finally use-after-free is triggered. > > > > Fixes this issue by always freeing hctx after releasing request queue. > > If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce > > a per-queue list to hold them, then try to resuse these hctxs if numa > > node is matched. > > This seems a little odd. Wouldn't it be much simpler to just keep > the hctx where it is, that is leave the queue_hw_ctx[] pointer in tact, > but have a flag marking it dead? There are several issues with that solution: 1) q->nr_hw_queues becomes not same with number of the real active hw queues 2) if one hctx is only marked as dead and not freed, after several times of updating nr_hw_queues, more and more hctx instances will be wasted. 3) q->queue_hw_ctx[] has to be re-allocated when nr_hw_queues is increased. So I think this patch should be simpler, either in concept or implementation. Thanks, Ming
Re: [RFC PATCH 1/4] fdomain: Resurrect driver (core)
On Wednesday 24 April 2019 08:02:12 Christoph Hellwig wrote: > > +static void fdomain_work(struct work_struct *work) > > +{ > > + struct fdomain *fd = container_of(work, struct fdomain, work); > > + struct Scsi_Host *sh = container_of((void *)fd, struct Scsi_Host, > > + hostdata); > > This looks odd. We should never need a void cast for container_of. This cast is present in all drivers involving container_of, struct Scsi_Host and hostdata. hostdata in struct Scsi_Host is defined as "unsigned long hostdata[0]"... -- Ondrej Zary
[PATCH 0/4] fdomain: Resurrect driver (modular version)
Resurrect previously removed fdomain driver, in modern style. Initialization is rewritten completely, with support for multiple cards, no more global state variables. Most of the code from interrupt handler is moved to a workqueue. This is a modularized version with core separated from bus-specific drivers (PCI, ISA and PCMCIA). Only PCI driver is tested for now. The other two could be dropped until they get tested. Changes since RFC: - multi-line comment style, some coding style - usage of scsi_k(un)map_atomic_sg for buffer access - static marking of fdomain_host_reset and fdomain_template - IRQ lookup moved to ISA bus driver - EXPORT_SYMBOL_GPL -- Ondrej Zary
[PATCH 3/4] fdomain: Resurrect driver (ISA support)
Future Domain 16xx ISA SCSI support card support. Currently untested. Signed-off-by: Ondrej Zary --- drivers/scsi/Kconfig | 14 +++ drivers/scsi/Makefile | 1 + drivers/scsi/fdomain_isa.c | 231 + 3 files changed, 246 insertions(+) create mode 100644 drivers/scsi/fdomain_isa.c diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig index f9d058a07e2a..2ea77dafbc00 100644 --- a/drivers/scsi/Kconfig +++ b/drivers/scsi/Kconfig @@ -684,6 +684,20 @@ config SCSI_FDOMAIN_PCI To compile this driver as a module, choose M here: the module will be called fdomain_pci. +config SCSI_FDOMAIN_ISA + tristate "Future Domain 16xx ISA SCSI support" + depends on ISA && SCSI + select CHECK_SIGNATURE + select SCSI_FDOMAIN + help + This is support for Future Domain's 16-bit SCSI host adapters + (TMC-1660/1680, TMC-1650/1670, TMC-1610M/MER/MEX) and other adapters + with ISA bus based on the Future Domain chipsets (Quantum ISA-200S, + ISA-250MG; and at least one IBM board). + + To compile this driver as a module, choose M here: the + module will be called fdomain_isa. + config SCSI_GDTH tristate "Intel/ICP (former GDT SCSI Disk Array) RAID Controller support" depends on PCI && SCSI diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile index f6cc4fbe6957..2e2c777d 100644 --- a/drivers/scsi/Makefile +++ b/drivers/scsi/Makefile @@ -78,6 +78,7 @@ obj-$(CONFIG_SCSI_ISCI) += isci/ obj-$(CONFIG_SCSI_IPS) += ips.o obj-$(CONFIG_SCSI_FDOMAIN) += fdomain.o obj-$(CONFIG_SCSI_FDOMAIN_PCI) += fdomain_pci.o +obj-$(CONFIG_SCSI_FDOMAIN_ISA) += fdomain_isa.o obj-$(CONFIG_SCSI_GENERIC_NCR5380) += g_NCR5380.o obj-$(CONFIG_SCSI_QLOGIC_FAS) += qlogicfas408.o qlogicfas.o obj-$(CONFIG_PCMCIA_QLOGIC)+= qlogicfas408.o diff --git a/drivers/scsi/fdomain_isa.c b/drivers/scsi/fdomain_isa.c new file mode 100644 index ..3008d5118d7a --- /dev/null +++ b/drivers/scsi/fdomain_isa.c @@ -0,0 +1,231 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include "fdomain.h" + +#define MAXBOARDS_PARAM 4 +static int io[MAXBOARDS_PARAM] = { 0, 0, 0, 0 }; +module_param_hw_array(io, int, ioport, NULL, 0); +MODULE_PARM_DESC(io, "base I/O address of controller (0x140, 0x150, 0x160, 0x170)"); + +static int irq[MAXBOARDS_PARAM] = { 0, 0, 0, 0 }; +module_param_hw_array(irq, int, irq, NULL, 0); +MODULE_PARM_DESC(irq, "IRQ of controller (0=auto [default])"); + +static int scsi_id[MAXBOARDS_PARAM] = { 0, 0, 0, 0 }; +module_param_hw_array(scsi_id, int, other, NULL, 0); +MODULE_PARM_DESC(scsi_id, "SCSI ID of controller (default = 6)"); + +static unsigned long addresses[] = { + 0xc8000, + 0xca000, + 0xce000, + 0xde000, +}; +#define ADDRESS_COUNT ARRAY_SIZE(addresses) + +static unsigned short ports[] = { 0x140, 0x150, 0x160, 0x170 }; +#define PORT_COUNT ARRAY_SIZE(ports) + +static unsigned short irqs[] = { 3, 5, 10, 11, 12, 14, 15, 0 }; + +/* This driver works *ONLY* for Future Domain cards using the TMC-1800, + * TMC-18C50, or TMC-18C30 chip. This includes models TMC-1650, 1660, 1670, + * and 1680. These are all 16-bit cards. + * + * The following BIOS signature signatures are for boards which do *NOT* + * work with this driver (these TMC-8xx and TMC-9xx boards may work with the + * Seagate driver): + * + * FUTURE DOMAIN CORP. (C) 1986-1988 V4.0I 03/16/88 + * FUTURE DOMAIN CORP. (C) 1986-1989 V5.0C2/14/89 + * FUTURE DOMAIN CORP. (C) 1986-1989 V6.0A7/28/89 + * FUTURE DOMAIN CORP. (C) 1986-1990 V6.0105/31/90 + * FUTURE DOMAIN CORP. (C) 1986-1990 V6.0209/18/90 + * FUTURE DOMAIN CORP. (C) 1986-1990 V7.009/18/90 + * FUTURE DOMAIN CORP. (C) 1992 V8.00.004/02/92 + * + * (The cards which do *NOT* work are all 8-bit cards -- although some of + * them have a 16-bit form-factor, the upper 8-bits are used only for IRQs + * and are *NOT* used for data. You can tell the difference by following + * the tracings on the circuit board -- if only the IRQ lines are involved, + * you have a "8-bit" card, and should *NOT* use this driver.) + */ + +static struct signature signatures[] = { +/* 1 2 3 4 5 6 */ +/* 123456789012345678901234567890123456789012345678901234567890 */ +{ "FUTURE DOMAIN CORP. (C) 1986-1990 1800-V2.07/28/89", 5, 50, 2, 0, 0 }, +{ "FUTURE DOMAIN CORP. (C) 1986-1990 1800-V1.07/28/89", 5, 50, 2, 0, 0 }, +{ "FUTURE DOMAIN CORP. (C) 1986-1990 1800-V2.07/28/89", 72, 50, 2, 0, 2 }, +{ "FUTURE DOMAIN CORP. (C) 1986-1990 1800-V2.0", 73, 43, 2, 0, 3 }, +{ "FUTURE DOMAIN CORP. (C) 1991 1800-V2.0.", 72, 39, 2, 0, 4 }, +{ "FUTURE DOMAIN CORP. (C) 1992 V3.00.004/02/92", 5, 44, 3, 0, 0 }, +{ "FUTURE DOMAIN TMC-18XX (C) 1993 V3.203/12/93", 5, 44, 3, 2, 0 }, +{ "IBM F1 P2 BIOS v1.0104/29/93",
[PATCH 1/4] fdomain: Resurrect driver (core)
Future Domain TMC-16xx/TMC-3260 SCSI driver. This is the core driver, common for PCI, ISA and PCMCIA cards. Signed-off-by: Ondrej Zary --- drivers/scsi/Kconfig | 4 + drivers/scsi/Makefile | 1 + drivers/scsi/fdomain.c | 600 + drivers/scsi/fdomain.h | 63 ++ 4 files changed, 668 insertions(+) create mode 100644 drivers/scsi/fdomain.c create mode 100644 drivers/scsi/fdomain.h diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig index d528018e6fa8..3d6b1f47cbb5 100644 --- a/drivers/scsi/Kconfig +++ b/drivers/scsi/Kconfig @@ -663,6 +663,10 @@ config SCSI_DMX3191D To compile this driver as a module, choose M here: the module will be called dmx3191d. +config SCSI_FDOMAIN + tristate + depends on SCSI + config SCSI_GDTH tristate "Intel/ICP (former GDT SCSI Disk Array) RAID Controller support" depends on PCI && SCSI diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile index 8826111fdf4a..b8fbc6d2de54 100644 --- a/drivers/scsi/Makefile +++ b/drivers/scsi/Makefile @@ -76,6 +76,7 @@ obj-$(CONFIG_SCSI_AIC94XX)+= aic94xx/ obj-$(CONFIG_SCSI_PM8001) += pm8001/ obj-$(CONFIG_SCSI_ISCI)+= isci/ obj-$(CONFIG_SCSI_IPS) += ips.o +obj-$(CONFIG_SCSI_FDOMAIN) += fdomain.o obj-$(CONFIG_SCSI_GENERIC_NCR5380) += g_NCR5380.o obj-$(CONFIG_SCSI_QLOGIC_FAS) += qlogicfas408.o qlogicfas.o obj-$(CONFIG_PCMCIA_QLOGIC)+= qlogicfas408.o diff --git a/drivers/scsi/fdomain.c b/drivers/scsi/fdomain.c new file mode 100644 index ..edf75c7f8e3f --- /dev/null +++ b/drivers/scsi/fdomain.c @@ -0,0 +1,600 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Driver for Future Domain TMC-16x0 and TMC-3260 SCSI host adapters + * Copyright 2019 Ondrej Zary + * + * Original driver by + * Rickard E. Faith, fa...@cs.unc.edu + * + * Future Domain BIOS versions supported for autodetect: + *2.0, 3.0, 3.2, 3.4 (1.0), 3.5 (2.0), 3.6, 3.61 + * Chips supported: + *TMC-1800, TMC-18C50, TMC-18C30, TMC-36C70 + * Boards supported: + *Future Domain TMC-1650, TMC-1660, TMC-1670, TMC-1680, TMC-1610M/MER/MEX + *Future Domain TMC-3260 (PCI) + *Quantum ISA-200S, ISA-250MG + *Adaptec AHA-2920A (PCI) [BUT *NOT* AHA-2920C -- use aic7xxx instead] + *IBM ? + * + * NOTE: + * + * The Adaptec AHA-2920C has an Adaptec AIC-7850 chip on it. + * Use the aic7xxx driver for this board. + * + * The Adaptec AHA-2920A has a Future Domain chip on it, so this is the right + * driver for that card. Unfortunately, the boxes will probably just say + * "2920", so you'll have to look on the card for a Future Domain logo, or a + * letter after the 2920. + * + * If you have a TMC-8xx or TMC-9xx board, then this is not the driver for + * your board. + * + * DESCRIPTION: + * + * This is the Linux low-level SCSI driver for Future Domain TMC-1660/1680 + * TMC-1650/1670, and TMC-3260 SCSI host adapters. The 1650 and 1670 have a + * 25-pin external connector, whereas the 1660 and 1680 have a SCSI-2 50-pin + * high-density external connector. The 1670 and 1680 have floppy disk + * controllers built in. The TMC-3260 is a PCI bus card. + * + * Future Domain's older boards are based on the TMC-1800 chip, and this + * driver was originally written for a TMC-1680 board with the TMC-1800 chip. + * More recently, boards are being produced with the TMC-18C50 and TMC-18C30 + * chips. + * + * Please note that the drive ordering that Future Domain implemented in BIOS + * versions 3.4 and 3.5 is the opposite of the order (currently) used by the + * rest of the SCSI industry. + * + * + * REFERENCES USED: + * + * "TMC-1800 SCSI Chip Specification (FDC-1800T)", Future Domain Corporation, + * 1990. + * + * "Technical Reference Manual: 18C50 SCSI Host Adapter Chip", Future Domain + * Corporation, January 1992. + * + * "LXT SCSI Products: Specifications and OEM Technical Manual (Revision + * B/September 1991)", Maxtor Corporation, 1991. + * + * "7213S product Manual (Revision P3)", Maxtor Corporation, 1992. + * + * "Draft Proposed American National Standard: Small Computer System + * Interface - 2 (SCSI-2)", Global Engineering Documents. (X3T9.2/86-109, + * revision 10h, October 17, 1991) + * + * Private communications, Drew Eckhardt (d...@cs.colorado.edu) and Eric + * Youngdale (er...@cais.com), 1992. + * + * Private communication, Tuong Le (Future Domain Engineering department), + * 1994. (Disk geometry computations for Future Domain BIOS version 3.4, and + * TMC-18C30 detection.) + * + * Hogan, Thom. The Programmer's PC Sourcebook. Microsoft Press, 1988. Page + * 60 (2.39: Disk Partition Table Layout). + * + * "18C30 Technical Reference Manual", Future Domain Corporation, 1993, page + * 6-1. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "fdomain.h" + +/* + * FIFO_COUNT: The host adapter has an 8K cache (host adapters based on the + * 18
[PATCH 4/4] fdomain: Resurrect driver (PCMCIA support)
Future Domain PCMCIA SCSI support card support. Currently untested. Signed-off-by: Ondrej Zary --- drivers/scsi/pcmcia/Kconfig | 10 + drivers/scsi/pcmcia/Makefile | 1 + drivers/scsi/pcmcia/fdomain_cs.c | 89 3 files changed, 100 insertions(+) create mode 100644 drivers/scsi/pcmcia/fdomain_cs.c diff --git a/drivers/scsi/pcmcia/Kconfig b/drivers/scsi/pcmcia/Kconfig index 2d435f105b16..169d93f90a30 100644 --- a/drivers/scsi/pcmcia/Kconfig +++ b/drivers/scsi/pcmcia/Kconfig @@ -19,6 +19,16 @@ config PCMCIA_AHA152X To compile this driver as a module, choose M here: the module will be called aha152x_cs. +config PCMCIA_FDOMAIN + tristate "Future Domain PCMCIA support" + select SCSI_FDOMAIN + help + Say Y here if you intend to attach this type of PCMCIA SCSI host + adapter to your computer. + + To compile this driver as a module, choose M here: the + module will be called fdomain_cs. + config PCMCIA_NINJA_SCSI tristate "NinjaSCSI-3 / NinjaSCSI-32Bi (16bit) PCMCIA support" depends on !64BIT diff --git a/drivers/scsi/pcmcia/Makefile b/drivers/scsi/pcmcia/Makefile index a5a24dd44e7e..02f5b44a2685 100644 --- a/drivers/scsi/pcmcia/Makefile +++ b/drivers/scsi/pcmcia/Makefile @@ -4,6 +4,7 @@ ccflags-y := -I $(srctree)/drivers/scsi # 16-bit client drivers obj-$(CONFIG_PCMCIA_QLOGIC)+= qlogic_cs.o +obj-$(CONFIG_PCMCIA_FDOMAIN) += fdomain_cs.o obj-$(CONFIG_PCMCIA_AHA152X) += aha152x_cs.o obj-$(CONFIG_PCMCIA_NINJA_SCSI)+= nsp_cs.o obj-$(CONFIG_PCMCIA_SYM53C500) += sym53c500_cs.o diff --git a/drivers/scsi/pcmcia/fdomain_cs.c b/drivers/scsi/pcmcia/fdomain_cs.c new file mode 100644 index ..eee1379ad67b --- /dev/null +++ b/drivers/scsi/pcmcia/fdomain_cs.c @@ -0,0 +1,89 @@ +// SPDX-License-Identifier: (GPL-2.0 OR MPL-1.1) +/* + * Driver for Future Domain-compatible PCMCIA SCSI cards + * Copyright 2019 Ondrej Zary + * + * The initial developer of the original code is David A. Hinds + * . Portions created by David A. Hinds + * are Copyright (C) 1999 David A. Hinds. All Rights Reserved. + */ + +#include +#include +#include +#include +#include +#include "fdomain.h" + +MODULE_AUTHOR("Ondrej Zary, David Hinds"); +MODULE_DESCRIPTION("Future Domain PCMCIA SCSI driver"); +MODULE_LICENSE("Dual MPL/GPL"); + +static int fdomain_config_check(struct pcmcia_device *p_dev, void *priv_data) +{ + p_dev->io_lines = 10; + p_dev->resource[0]->end = 0x10; + p_dev->resource[0]->flags &= ~IO_DATA_PATH_WIDTH; + p_dev->resource[0]->flags |= IO_DATA_PATH_WIDTH_AUTO; + return pcmcia_request_io(p_dev); +} + +static int fdomain_probe(struct pcmcia_device *link) +{ + int ret; + struct Scsi_Host *sh; + + link->config_flags |= CONF_ENABLE_IRQ | CONF_AUTO_SET_IO; + link->config_regs = PRESENT_OPTION; + + ret = pcmcia_loop_config(link, fdomain_config_check, NULL); + if (ret) + return ret; + + ret = pcmcia_enable_device(link); + if (ret) + goto fail; + + sh = fdomain_create(link->resource[0]->start, link->irq, 7, 0, NULL, + &link->dev); + if (!sh) { + dev_err(&link->dev, "Controller initialization failed"); + ret = -ENODEV; + goto fail; + } + + link->priv = sh; + + return 0; + +fail: + pcmcia_disable_device(link); + return ret; +} + +static void fdomain_remove(struct pcmcia_device *link) +{ + fdomain_destroy(link->priv); + pcmcia_disable_device(link); +} + +static const struct pcmcia_device_id fdomain_ids[] = { + PCMCIA_DEVICE_PROD_ID12("IBM Corp.", "SCSI PCMCIA Card", 0xe3736c88, + 0x859cad20), + PCMCIA_DEVICE_PROD_ID1("SCSI PCMCIA Adapter Card", 0x8dacb57e), + PCMCIA_DEVICE_PROD_ID12(" SIMPLE TECHNOLOGY Corporation", + "SCSI PCMCIA Credit Card Controller", + 0x182bdafe, 0xc80d106f), + PCMCIA_DEVICE_NULL, +}; +MODULE_DEVICE_TABLE(pcmcia, fdomain_ids); + +static struct pcmcia_driver fdomain_cs_driver = { + .owner = THIS_MODULE, + .name = "fdomain_cs", + .probe = fdomain_probe, + .remove = fdomain_remove, + .id_table = fdomain_ids, +}; + +module_pcmcia_driver(fdomain_cs_driver); -- Ondrej Zary
[PATCH 2/4] fdomain: Resurrect driver (PCI support)
Future Domain TMC-3260/AHA-2920A PCI card support. Tested on Adaptec AHA-2920A PCI card. Signed-off-by: Ondrej Zary --- drivers/scsi/Kconfig | 17 drivers/scsi/Makefile | 1 + drivers/scsi/fdomain_pci.c | 68 ++ 3 files changed, 86 insertions(+) create mode 100644 drivers/scsi/fdomain_pci.c diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig index 3d6b1f47cbb5..f9d058a07e2a 100644 --- a/drivers/scsi/Kconfig +++ b/drivers/scsi/Kconfig @@ -667,6 +667,23 @@ config SCSI_FDOMAIN tristate depends on SCSI +config SCSI_FDOMAIN_PCI + tristate "Future Domain TMC-3260/AHA-2920A PCI SCSI support" + depends on PCI && SCSI + select SCSI_FDOMAIN + help + This is support for Future Domain's PCI SCSI host adapters (TMC-3260) + and other adapters with PCI bus based on the Future Domain chipsets + (Adaptec AHA-2920A). + + NOTE: Newer Adaptec AHA-2920C boards use the Adaptec AIC-7850 chip + and should use the aic7xxx driver ("Adaptec AIC7xxx chipset SCSI + controller support"). This Future Domain driver works with the older + Adaptec AHA-2920A boards with a Future Domain chip on them. + + To compile this driver as a module, choose M here: the + module will be called fdomain_pci. + config SCSI_GDTH tristate "Intel/ICP (former GDT SCSI Disk Array) RAID Controller support" depends on PCI && SCSI diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile index b8fbc6d2de54..f6cc4fbe6957 100644 --- a/drivers/scsi/Makefile +++ b/drivers/scsi/Makefile @@ -77,6 +77,7 @@ obj-$(CONFIG_SCSI_PM8001) += pm8001/ obj-$(CONFIG_SCSI_ISCI)+= isci/ obj-$(CONFIG_SCSI_IPS) += ips.o obj-$(CONFIG_SCSI_FDOMAIN) += fdomain.o +obj-$(CONFIG_SCSI_FDOMAIN_PCI) += fdomain_pci.o obj-$(CONFIG_SCSI_GENERIC_NCR5380) += g_NCR5380.o obj-$(CONFIG_SCSI_QLOGIC_FAS) += qlogicfas408.o qlogicfas.o obj-$(CONFIG_PCMCIA_QLOGIC)+= qlogicfas408.o diff --git a/drivers/scsi/fdomain_pci.c b/drivers/scsi/fdomain_pci.c new file mode 100644 index ..381a7157c078 --- /dev/null +++ b/drivers/scsi/fdomain_pci.c @@ -0,0 +1,68 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include "fdomain.h" + +static int fdomain_pci_probe(struct pci_dev *pdev, +const struct pci_device_id *d) +{ + int err; + struct Scsi_Host *sh; + + err = pci_enable_device(pdev); + if (err) + goto fail; + + err = pci_request_regions(pdev, "fdomain_pci"); + if (err) + goto disable_device; + + err = -ENODEV; + if (pci_resource_len(pdev, 0) == 0) + goto release_region; + + sh = fdomain_create(pci_resource_start(pdev, 0), pdev->irq, 7, 0, NULL, + &pdev->dev); + if (!sh) + goto release_region; + + pci_set_drvdata(pdev, sh); + return 0; + +release_region: + pci_release_regions(pdev); +disable_device: + pci_disable_device(pdev); +fail: + return err; +} + +static void fdomain_pci_remove(struct pci_dev *pdev) +{ + struct Scsi_Host *sh = pci_get_drvdata(pdev); + + fdomain_destroy(sh); + pci_release_regions(pdev); + pci_disable_device(pdev); +} + +static struct pci_device_id fdomain_pci_table[] = { + { PCI_DEVICE(PCI_VENDOR_ID_FD, PCI_DEVICE_ID_FD_36C70) }, + {} +}; +MODULE_DEVICE_TABLE(pci, fdomain_pci_table); + +static struct pci_driver fdomain_pci_driver = { + .name = "fdomain_pci", + .id_table = fdomain_pci_table, + .probe = fdomain_pci_probe, + .remove = fdomain_pci_remove, + .driver.pm = FDOMAIN_PM_OPS, +}; + +module_pci_driver(fdomain_pci_driver); + +MODULE_AUTHOR("Ondrej Zary, Rickard E. Faith"); +MODULE_DESCRIPTION("Future Domain TMC-3260 PCI SCSI driver"); +MODULE_LICENSE("GPL"); -- Ondrej Zary
Re: [PATCH V8 4/7] blk-mq: split blk_mq_alloc_and_init_hctx into two parts
On 4/28/19 10:14 AM, Ming Lei wrote: Split blk_mq_alloc_and_init_hctx into two parts, and one is blk_mq_alloc_hctx() for allocating all hctx resources, another is blk_mq_init_hctx() for initializing hctx, which serves as counter-part of blk_mq_exit_hctx(). Cc: Dongli Zhang Cc: James Smart Cc: Bart Van Assche Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen , Cc: Christoph Hellwig , Cc: James E . J . Bottomley , Reviewed-by: Hannes Reinecke Tested-by: James Smart Signed-off-by: Ming Lei --- block/blk-mq.c | 138 - 1 file changed, 77 insertions(+), 61 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index d98cb9614dfa..44ecca6b0cac 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2284,15 +2284,70 @@ static void blk_mq_exit_hw_queues(struct request_queue *q, } } +static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set) +{ + int hw_ctx_size = sizeof(struct blk_mq_hw_ctx); + + BUILD_BUG_ON(ALIGN(offsetof(struct blk_mq_hw_ctx, srcu), + __alignof__(struct blk_mq_hw_ctx)) != +sizeof(struct blk_mq_hw_ctx)); + + if (tag_set->flags & BLK_MQ_F_BLOCKING) + hw_ctx_size += sizeof(struct srcu_struct); + + return hw_ctx_size; +} + static int blk_mq_init_hctx(struct request_queue *q, struct blk_mq_tag_set *set, struct blk_mq_hw_ctx *hctx, unsigned hctx_idx) { - int node; + hctx->queue_num = hctx_idx; - node = hctx->numa_node; + cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead); + + hctx->tags = set->tags[hctx_idx]; + + if (set->ops->init_hctx && + set->ops->init_hctx(hctx, set->driver_data, hctx_idx)) + goto unregister_cpu_notifier; + + if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, + hctx->numa_node)) + goto exit_hctx; + return 0; + + exit_hctx: + if (set->ops->exit_hctx) + set->ops->exit_hctx(hctx, hctx_idx); + unregister_cpu_notifier: + blk_mq_remove_cpuhp(hctx); + return -1; +} + +static struct blk_mq_hw_ctx * +blk_mq_alloc_hctx(struct request_queue *q, + struct blk_mq_tag_set *set, + unsigned hctx_idx, int node) +{ + struct blk_mq_hw_ctx *hctx; + + hctx = kzalloc_node(blk_mq_hw_ctx_size(set), + GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, + node); + if (!hctx) + goto fail_alloc_hctx; + + if (!zalloc_cpumask_var_node(&hctx->cpumask, + GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY, + node)) + goto free_hctx; + + atomic_set(&hctx->nr_active, 0); + hctx->numa_node = node; if (node == NUMA_NO_NODE) - node = hctx->numa_node = set->numa_node; + hctx->numa_node = set->numa_node; + node = hctx->numa_node; INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn); spin_lock_init(&hctx->lock); The 'hctx_idx' argument is now unused, and should be removed from the function definition. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg)