In nr_hw_queues >1 cases when certain number of cpus are onlined/or offlined, that results change in request_queue map in block-mq layer, we see the kernel dumping like:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000080 IP: [<ffffffff8128e2f2>] cpumask_set_cpu+0x6/0xd PGD 6d957067 PUD 7604c067 PMD 0 Oops: 0002 [#1] SMP Modules linked in: null_blk CPU: 2 PID: 1926 Comm: bash Not tainted 4.3.0-rc2+ #24 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8800724cd1c0 ti: ffff880070a2c000 task.ti: ffff880070a2c000 RIP: 0010:[<ffffffff8128e2f2>] [<ffffffff8128e2f2>] cpumask_set_cpu+0x6/0xd RSP: 0018:ffff880070a2fbc8 EFLAGS: 00010203 RAX: ffff880073eedc00 RBX: ffff88006cc88000 RCX: ffff88006c06b000 RDX: 0000000000000007 RSI: 0000000000000080 RDI: 0000000000000008 RBP: ffff880070a2fbc8 R08: ffff88006c06ac00 R09: ffff88006c06ad48 R10: ffff880000004ea8 R11: ffff88006c069650 R12: ffff88007378fe28 R13: 0000000000000008 R14: ffffe8ffff500200 R15: ffffffff81d2a630 FS: 00007fa34803b700(0000) GS:ffff88007cc40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000080 CR3: 00000000761d2000 CR4: 00000000000006e0 Stack: ffff880070a2fc18 ffffffff8128edec 0000000000000000 ffff880073eedc00 0000000000000039 ffff88006cc88000 0000000000000007 00000000ffffffe3 ffffffff81cef2c0 0000000000000000 ffff880070a2fc38 ffffffff8129049a Call Trace: [<ffffffff8128edec>] blk_mq_map_swqueue+0x9d/0x206 [<ffffffff8129049a>] blk_mq_queue_reinit_notify+0xe3/0x144 [<ffffffff8108b403>] notifier_call_chain+0x37/0x63 [<ffffffff8108b48b>] __raw_notifier_call_chain+0xe/0x10 [<ffffffff810729ea>] __cpu_notify+0x20/0x32 [<ffffffff81072c24>] cpu_notify_nofail+0x13/0x1b [<ffffffff81073111>] _cpu_down+0x18a/0x264 [<ffffffff811884ce>] ? path_put+0x1f/0x23 [<ffffffff81073218>] cpu_down+0x2d/0x3a [<ffffffff813a9ad8>] cpu_subsys_offline+0x14/0x16 [<ffffffff813a55c6>] device_offline+0x65/0x94 [<ffffffff813a56b3>] online_store+0x48/0x68 [<ffffffff811e0880>] ? kernfs_fop_write+0x6f/0x143 [<ffffffff813a3046>] dev_attr_store+0x20/0x22 [<ffffffff811e1037>] sysfs_kf_write+0x3c/0x3e [<ffffffff811e08fe>] kernfs_fop_write+0xed/0x143 [<ffffffff8117fe0c>] __vfs_write+0x28/0xa6 [<ffffffff8124b998>] ? security_file_permission+0x3c/0x44 [<ffffffff810a5a1e>] ? percpu_down_read+0x21/0x42 [<ffffffff81181ee5>] ? __sb_start_write+0x24/0x41 [<ffffffff81180956>] vfs_write+0x8d/0xd1 [<ffffffff81180b37>] SyS_write+0x59/0x83 [<ffffffff816df46e>] entry_SYSCALL_64_fastpath+0x12/0x71 Code: 03 75 06 65 48 ff 0a eb 1a f0 48 83 af 68 07 00 00 01 74 02 eb 0d 48 8d bf 68 07 00 00 ff 90 78 07 00 00 5d c3 55 89 ff 48 89 e5 <f0> 48 0f ab 3e 5d c3 0f 1f 44 00 00 55 8b 4e 44 31 d2 8b b7 94 RIP [<ffffffff8128e2f2>] cpumask_set_cpu+0x6/0xd RSP <ffff880070a2fbc8> CR2: 0000000000000080 How to reproduce: 1. create 80 vcpu guest with 10 core 8 threads 2. modprobe null_blk submit_queues=64 3. for i in 72 73 74 75 76 77 78 79 ; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done Reason: We try to set freed hwctx->tag->cpumask in blk_mq_map_swqueue(). Introduced during commit f26cdc8536ad ("blk-mq: Shared tag enhancements"). What is happening: When certain number of cpus are onlined/offlined, that results in blk_mq_update_queue_map, we could potentially end up in new mapping to hwctx. Subsequent blk_mq_map_swqueue of request_queue, tries to set the hwctx->tags->cpumask which is already freed by blk_mq_free_rq_map in earlier itearation when it was not mapped. Fix: Set the hwctx->tags->cpumask only after blk_mq_init_rq_map() is done hwctx->tags->cpumask does not follow the hwctx->cpumask after new mapping even in the cases where new mapping does not cause problem. That is also fixed with this change. This problem is originally found in powervm which had 160 cpus (SMT8), 128 nr_hw_queues. The dump was easily reproduced with offlining last core and it has been a blocker issue because cpu hotplug is a common case for DLPAR. Signed-off-by: Raghavendra K T <raghavendra...@linux.vnet.ibm.com> --- block/blk-mq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index f2d67b4..39a7834 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1811,7 +1811,6 @@ static void blk_mq_map_swqueue(struct request_queue *q) hctx = q->mq_ops->map_queue(q, i); cpumask_set_cpu(i, hctx->cpumask); - cpumask_set_cpu(i, hctx->tags->cpumask); ctx->index_hw = hctx->nr_ctx; hctx->ctxs[hctx->nr_ctx++] = ctx; } @@ -1836,6 +1835,7 @@ static void blk_mq_map_swqueue(struct request_queue *q) if (!set->tags[i]) set->tags[i] = blk_mq_init_rq_map(set, i); hctx->tags = set->tags[i]; + cpumask_copy(hctx->tags->cpumask, hctx->cpumask); WARN_ON(!hctx->tags); /* -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/