On 16.09.20 16:39, Scott Cheloha wrote: > On Wed, Sep 16, 2020 at 09:39:53AM +0200, David Hildenbrand wrote: >> On 15.09.20 21:46, Scott Cheloha wrote: >>> During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid() >>> to determine which node id (nid) to use when later calling __add_memory(). >>> >>> This is wasteful. On pseries, memory_add_physaddr_to_nid() finds an >>> appropriate nid for a given address by looking up the LMB containing the >>> address and then passing that LMB to of_drconf_to_nid_single() to get the >>> nid. In dlpar_add_lmb() we get this address from the LMB itself. >>> >>> In short, we have a pointer to an LMB and then we are searching for >>> that LMB *again* in order to find its nid. >>> >>> If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we >>> can skip the redundant lookup. The only error handling we need to >>> duplicate from memory_add_physaddr_to_nid() is the fallback to the >>> default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or >>> an invalid nid. >>> >>> Skipping the extra lookup makes hot-add operations faster, especially >>> on machines with many LMBs. >>> >>> Consider an LPAR with 126976 LMBs. In one test, hot-adding 126000 >>> LMBs on an upatched kernel took ~3.5 hours while a patched kernel >>> completed the same operation in ~2 hours: >>> >>> Unpatched (12450 seconds): >>> Sep 9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000 >>> Sep 9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to >>> hot-add 126000 LMB(s) >>> [...] >>> Sep 9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 >>> (drc index 80000002) was hot-added >>> >>> Patched (7065 seconds): >>> Sep 8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000 >>> Sep 8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to >>> hot-add 126000 LMB(s) >>> [...] >>> Sep 8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 >>> (drc index 80000002) was hot-added >>> >>> It should be noted that the speedup grows more substantial when >>> hot-adding LMBs at the end of the drconf range. This is because we >>> are skipping a linear LMB search. >>> >>> To see the distinction, consider smaller hot-add test on the same >>> LPAR. A perf-stat run with 10 iterations showed that hot-adding 4096 >>> LMBs completed less than 1 second faster on a patched kernel: >>> >>> Unpatched: >>> Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs): >>> >>> 104,753.42 msec task-clock # 0.992 CPUs utilized >>> ( +- 0.55% ) >>> 4,708 context-switches # 0.045 K/sec >>> ( +- 0.69% ) >>> 2,444 cpu-migrations # 0.023 K/sec >>> ( +- 1.25% ) >>> 394 page-faults # 0.004 K/sec >>> ( +- 0.22% ) >>> 445,902,503,057 cycles # 4.257 GHz >>> ( +- 0.55% ) (66.67%) >>> 8,558,376,740 stalled-cycles-frontend # 1.92% frontend >>> cycles idle ( +- 0.88% ) (49.99%) >>> 300,346,181,651 stalled-cycles-backend # 67.36% backend cycles >>> idle ( +- 0.76% ) (50.01%) >>> 258,091,488,691 instructions # 0.58 insn per cycle >>> # 1.16 stalled cycles >>> per insn ( +- 0.22% ) (66.67%) >>> 70,568,169,256 branches # 673.660 M/sec >>> ( +- 0.17% ) (50.01%) >>> 3,100,725,426 branch-misses # 4.39% of all >>> branches ( +- 0.20% ) (49.99%) >>> >>> 105.583 +- 0.589 seconds time elapsed ( +- 0.56% ) >>> >>> Patched: >>> Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs): >>> >>> 104,055.69 msec task-clock # 0.993 CPUs utilized >>> ( +- 0.32% ) >>> 4,606 context-switches # 0.044 K/sec >>> ( +- 0.20% ) >>> 2,463 cpu-migrations # 0.024 K/sec >>> ( +- 0.93% ) >>> 394 page-faults # 0.004 K/sec >>> ( +- 0.25% ) >>> 442,951,129,921 cycles # 4.257 GHz >>> ( +- 0.32% ) (66.66%) >>> 8,710,413,329 stalled-cycles-frontend # 1.97% frontend >>> cycles idle ( +- 0.47% ) (50.06%) >>> 299,656,905,836 stalled-cycles-backend # 67.65% backend cycles >>> idle ( +- 0.39% ) (50.02%) >>> 252,731,168,193 instructions # 0.57 insn per cycle >>> # 1.19 stalled cycles >>> per insn ( +- 0.20% ) (66.66%) >>> 68,902,851,121 branches # 662.173 M/sec >>> ( +- 0.13% ) (49.94%) >>> 3,100,242,882 branch-misses # 4.50% of all >>> branches ( +- 0.15% ) (49.98%) >>> >>> 104.829 +- 0.325 seconds time elapsed ( +- 0.31% ) >>> >>> This is consistent. An add-by-count hot-add operation adds LMBs >>> greedily, so LMBs near the start of the drconf range are considered >>> first. On an otherwise idle LPAR with so many LMBs we would expect to >>> find the LMBs we need near the start of the drconf range, hence the >>> smaller speedup. >>> >>> Signed-off-by: Scott Cheloha <chel...@linux.ibm.com> >> >> >> Hi Scott, >> >> IIRC, ppc DLPAR does a single add_memory() [...] > > Yes. > >> [...] for each LMB (16 MB). > > The block size is set by the hypervisor. The default is 256MB. In > this test I had a block size of 256MB.
Oh, I wasn't aware that it's configurable, thanks for pointing that out (missed the custom memory_block_size_bytes() implementation). I wonder how it works with pseries_remove_memblock(), that uses MIN_MEMORY_BLOCK_SIZE with __remove_memory() - that will always BUG_ON in try_remove_memory() with BUG_ON(check_hotplug_memory_range(start, size)) in case the size is < memory_block_size_bytes(). Maybe that's not called on such machines ... > > On multi-terabyte machines I would effectively always expect a block > size of 256MB. 16MB blocks are supported, but it is not the default > setting so it is increasingly rare.> >> With tons of LMBs, this will also make /proc/iomem explode in size (using a >> list-based tree), making traversal significantly slower e.g., on >> insertions and system ram walks. >> >> I was wondering if you would get another performance boost under ppc >> when using MEMHP_MERGE_RESOURCE [1]. AFAIKs, the resource boundaries are >> not of interest. No guarantees, might be worth a try. > > I'll give it a shot. > >> Did you investigate what else makes memory hotplug that slow? (126000 >> LMBs correspond to roughly 2TB, that shouldn't take 2 hours ...) > > It was about ~31TB in 256MB blocks. It's a worst-case test (add all > the memory), but I'm pretty happy with a 1.5 hour improvement :) Yeah, definitely :) -- Thanks, David / dhildenb