https://bugzilla.kernel.org/show_bug.cgi?id=202511
--- Comment #22 from Barret Rhoden (b...@google.com) --- As far as the bisection goes, from the original bisect report, both of the commits that were merged were good. i.e. 3a3869f1 merged two good commits: 3036bc45364f and 488ad6d3678b. Only when the commits were combined was the system bad. Given it looks like a failure to do a percpu alloc, that makes sense - both branches could have had some change that when combined exhausted a resource. From the error messages, these are 'reserved' percpu allocs: [ 4.176816] percpu: allocation failed, size=8192 align=4096 atomic=0, alloc from reserved chunk failed The reserved space is rather small. From the early output, we can see it's only 8KB (the r8192): [ 0.000000] percpu: Embedded 54 pages/cpu @(____ptrval____) s181144 r8192 d31848 u262144 The alloc that failed was 8192, which is the entire reserved space, so my guess is that there was another alloc already, such that there wasn't enough for the 8192 alloc. It seems a little odd that there isn't enough reserved percpu space - or rather that someone is grabbing more space than they should. These reserved allocs are only made by modules, and then only by modules that use percpu data (if I'm reading kernel/module.c right). The default amount of 8192 is PERCPU_MODULE_RESERVE, which hasn't changed in years. I'd be curious who else is making reserved per_cpu allocations, regardless of failure. In kernel/module.c L652, we only print on failure. If you print regardless of failure, particularly the mod name, then we might know who the other one is. Maybe there are a bunch of benign small allocs, or maybe there's another 8192 out there. Regardless, I'd guess the main culprit is the amdkfd and drm modules, asking for 8192 out of a total 8192. I built with your 4.18 config from the merge commit. Both amdkfd and drm have percpu sections, e.g.: $ objdump -h drivers/gpu/drm/amd/amdkfd/amdkfd.ko .data..percpu 00002000 0000000000000000 0000000000000000 00022000 2**12 That matches the size (8192) and alignment (2^12) that we saw in the allocation message. drm.ko has something similar for its section. Looking at amdkfd.ko (objdump -D), it has: Disassembly of section .data..percpu: 0000000000000000 <kfd_processes_srcu_srcu_data>: and drm.ko has: Disassembly of section .data..percpu: 0000000000000000 <drm_unplug_srcu_srcu_data>: That looks like these two: drivers/gpu/drm/amd/amdkfd/kfd_process.c:DEFINE_SRCU(kfd_processes_srcu); drivers/gpu/drm/drm_drv.c:DEFINE_STATIC_SRCU(drm_unplug_srcu); Those SRCU macros expand to a DEFINE_PER_CPU(struct srcu_data), though the struct doesn't look huge. Not sure why it blows up to 8192 for its percpu section. The amdkfd one was added in 64d1c3a43a6f ("drm/amdkfd: Centralize IOMMUv2 code and make it conditional"), and the DRM one was added in bee330f3d672 ("drm: Use srcu to protect drm_device.unplugged"). Both are relatively recent. It doesn't look like there are a lot of drivers that use SRCU with those macros, so maybe that's something drm and amdkfd shouldn't be doing? Either that, or maybe there's something wrong that causes the SRCU percpu structure to get so large? Also, it's not clear that if these are the culprits, then why would they be working before the bisection point. If one of them succeeded, then the other should have failed (given they both try to alloc 8192 out of a total 8192 reservation). So maybe I'm missing something. -- You are receiving this mail because: You are watching the assignee of the bug. _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel