On 04/03/2019 20:59, Laurent Vivier wrote: > When we hotplug a CPU in a memoryless/cpuless node, > the kernel crashes when it rebuilds the sched_domains data. > > I reproduce this problem on POWER and with a pseries VM, with the following > QEMU parameters: > > -machine pseries -enable-kvm -m 8192 \ > -smp 2,maxcpus=8,sockets=4,cores=2,threads=1 \ > -numa node,nodeid=0,cpus=0-1,mem=0 \ > -numa node,nodeid=1,cpus=2-3,mem=8192 \ > -numa node,nodeid=2,cpus=4-5,mem=0 \ > -numa node,nodeid=3,cpus=6-7,mem=0 > > Then I can trigger the crash by hotplugging a CPU on node-id 3: > > (qemu) device_add host-spapr-cpu-core,core-id=7,node-id=3 > > Built 2 zonelists, mobility grouping on. Total pages: 130162 > Policy zone: Normal > WARNING: workqueue cpumask: online intersect > possible intersect > BUG: Kernel NULL pointer dereference at 0x00000400 > Faulting instruction address: 0xc000000000170edc > Oops: Kernel access of bad area, sig: 11 [#1] > LE SMP NR_CPUS=2048 NUMA pSeries > Modules linked in: ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT > nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute > bridge stp llc ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_security > ip6table_raw iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 > nf_defrag_ipv4 iptable_mangle iptable_security iptable_raw ebtable_filter > ebtables ip6table_filter ip6_tables iptable_filter xts vmx_crypto ip_tables > xfs libcrc32c virtio_net net_failover failover virtio_blk virtio_pci > virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod > CPU: 2 PID: 5661 Comm: kworker/2:0 Not tainted 5.0.0-rc6+ #20 > Workqueue: events cpuset_hotplug_workfn > NIP: c000000000170edc LR: c000000000170f98 CTR: 0000000000000000 > REGS: c000000003e931a0 TRAP: 0380 Not tainted (5.0.0-rc6+) > MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22284028 XER: 00000000 > CFAR: c000000000170f20 IRQMASK: 0 > GPR00: c000000000170f98 c000000003e93430 c0000000011ac500 c0000001efe22000 > GPR04: 0000000000000001 0000000000000000 0000000000000000 0000000000000010 > GPR08: 0000000000000001 0000000000000400 ffffffffffffffff 0000000000000000 > GPR12: 0000000000008800 c00000003fffd680 c0000001f14b0000 c0000000011e1bf0 > GPR16: c0000000011e61f4 c0000001efe22200 c0000001efe22020 c0000001fba80000 > GPR20: c0000001ff567a80 0000000000000001 c000000000e27a80 ffffffffffffe830 > GPR24: ffffffffffffec30 000000000000102f 000000000000102f c0000001efca1000 > GPR28: c0000001efca0400 c0000001efe22000 c0000001efe23bff c0000001efe22a00 > NIP [c000000000170edc] free_sched_groups+0x5c/0xf0 > LR [c000000000170f98] destroy_sched_domain+0x28/0x90 > Call Trace: > [c000000003e93430] [000000000000102f] 0x102f (unreliable) > [c000000003e93470] [c000000000170f98] destroy_sched_domain+0x28/0x90 > [c000000003e934a0] [c0000000001716e0] cpu_attach_domain+0x100/0x920 > [c000000003e93600] [c000000000173128] build_sched_domains+0x1228/0x1370 > [c000000003e93740] [c00000000017429c] partition_sched_domains+0x23c/0x400 > [c000000003e937e0] [c0000000001f5ec8] > rebuild_sched_domains_locked+0x78/0xe0 > [c000000003e93820] [c0000000001f9ff0] rebuild_sched_domains+0x30/0x50 > [c000000003e93850] [c0000000001fa1c0] cpuset_hotplug_workfn+0x1b0/0xb70 > [c000000003e93c80] [c00000000012e5a0] process_one_work+0x1b0/0x480 > [c000000003e93d20] [c00000000012e8f8] worker_thread+0x88/0x540 > [c000000003e93db0] [c00000000013714c] kthread+0x15c/0x1a0 > [c000000003e93e20] [c00000000000b55c] ret_from_kernel_thread+0x5c/0x80 > Instruction dump: > 2e240000 f8010010 f821ffc1 409e0014 48000080 7fbdf040 7fdff378 419e0074 > ebdf0000 4192002c e93f0010 7c0004ac <7d404828> 314affff 7d40492d 40c2fff4 > ---[ end trace f992c4a7d47d602a ]--- > > Kernel panic - not syncing: Fatal exception > > This happens in free_sched_groups() because the linked list of the > sched_groups is corrupted. Here what happens when we hotplug the CPU: > > - build_sched_groups() builds a sched_groups linked list for > sched_domain D1, with only one entry A, refcount=1 > > D1: A(ref=1) > > - build_sched_groups() builds a sched_groups linked list for > sched_domain D2, with the same entry A > > D2: A(ref=2) > > - build_sched_groups() builds a sched_groups linked list for > sched_domain D3, with the same entry A and a new entry B: > > D3: A(ref=3) -> B(ref=1) > > - destroy_sched_domain() is called for D1: > > D1: A(ref=3) -> B(ref=1) and as ref is 1, memory of B is released, > but A->next always points to B > > - destroy_sched_domain() is called for D3: > > D3: A(ref=2) -> B(ref=0) > > kernel crashes when it tries to use data inside B, as the memory has been > corrupted as it has been freed, the linked list (next) is broken too. > > This problem appears with commit 051f3ca02e46 > ("sched/topology: Introduce NUMA identity node sched domain"). > > If I compare function calls sequence before and after this commit I can see > in the working case build_overlap_sched_groups() is called instead of > build_sched_groups() and in this case the reference counters have all the > same value and the linked list can be correctly unallocated. > The involved commit has introduced the node domain, and in the case of > powerpc the node domain can overlap, whereas it should not happen. > > This happens because initially powerpc code computes > sched_domains_numa_masks of offline nodes as if they were merged with > node 0 (because firmware doesn't provide the distance information for > memoryless/cpuless nodes): > > node 0 1 2 3 > 0: 10 40 10 10 > 1: 40 10 40 40 > 2: 10 40 10 10 > 3: 10 40 10 10 > > We should have: > > node 0 1 2 3 > 0: 10 40 40 40 > 1: 40 10 40 40 > 2: 40 40 10 40 > 3: 40 40 40 10 > > And once a new CPU is added, node is onlined, numa masks are updated > but initial set bits are not cleared. This explains why nodes can overlap. > > This patch changes the initial code to not initialize the distance for > offline nodes. The distances will be updated when node will become online > (on CPU hotplug) as it is already done. > > This patch has been tested on powerpc but not on the other architectures. > They are impacted because the modified part is in the common code. > All comments are welcome (how to move the change to powerpc specific code > or if the other architectures can work with this change). > > Fixes: 051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched > domain") > Cc: Suravee Suthikulpanit <suravee.suthikulpa...@amd.com> > Cc: Srikar Dronamraju <sri...@linux.vnet.ibm.com> > Cc: Borislav Petkov <b...@suse.de> > Cc: David Gibson <da...@gibson.dropbear.id.au> > Cc: Michael Ellerman <m...@ellerman.id.au> > Cc: Nathan Fontenot <nf...@linux.vnet.ibm.com> > Cc: Michael Bringmann <m...@linux.vnet.ibm.com> > Cc: linuxppc-dev@lists.ozlabs.org > Cc: Ingo Molnar <mi...@redhat.com> > Cc: Peter Zijlstra <pet...@infradead.org> > Signed-off-by: Laurent Vivier <lviv...@redhat.com> > --- > > Notes: > v3: fix the root cause of the problem (sched numa mask initialization) > v2: add scheduler maintainers in the CC: list > > kernel/sched/topology.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index 3f35ba1d8fde..24831b86533b 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -1622,8 +1622,10 @@ void sched_init_numa(void) > return; > > sched_domains_numa_masks[i][j] = mask; > + if (!node_state(j, N_ONLINE)) > + continue; > > - for_each_node(k) { > + for_each_online_node(k) { > if (node_distance(j, k) > > sched_domains_numa_distance[i]) > continue; > >
Another way to avoid the nodes overlapping for the offline nodes at startup is to ensure the default values don't define a distance that merge all offline nodes into node 0. A powerpc specific patch can workaround the kernel crash by doing this: diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index 87f0dd0..3ba29bb 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -623,6 +623,7 @@ static int __init parse_numa_properties(void) struct device_node *memory; int default_nid = 0; unsigned long i; + int nid, dist; if (numa_enabled == 0) { printk(KERN_WARNING "NUMA disabled by user\n"); @@ -636,6 +637,10 @@ static int __init parse_numa_properties(void) dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth); + for (nid = 0; nid < MAX_NUMNODES; nid ++) + for (dist = 0; dist < MAX_DISTANCE_REF_POINTS; dist++) + distance_lookup_table[nid][dist] = nid; + /* * Even though we connect cpus to numa domains later in SMP * init, we need to know the node ids now. This is because Any comment? If this is not the good way to do, does someone have a better idea how to fix the kernel crash? Thanks, Laurent