On 05/24/2017 06:19 AM, Michael Ellerman wrote: > Michael Bringmann <m...@linux.vnet.ibm.com> writes: > >> On 05/23/2017 04:49 PM, Reza Arbab wrote: >>> On Tue, May 23, 2017 at 03:05:08PM -0500, Michael Bringmann wrote: >>>> On 05/23/2017 10:52 AM, Reza Arbab wrote: >>>>> On Tue, May 23, 2017 at 10:15:44AM -0500, Michael Bringmann wrote: >>>>>> +static void setup_nodes(void) >>>>>> +{ >>>>>> + int i, l = 32 /* MAX_NUMNODES */; >>>>>> + >>>>>> + for (i = 0; i < l; i++) { >>>>>> + if (!node_possible(i)) { >>>>>> + setup_node_data(i, 0, 0); >>>>>> + node_set(i, node_possible_map); >>>>>> + } >>>>>> + } >>>>>> +} >>>>> >>>>> This seems to be a workaround for 3af229f2071f ("powerpc/numa: Reset >>>>> node_possible_map to only node_online_map"). >>>> >>>> They may be related, but that commit is not a replacement. The above >>>> patch ensures that >>>> there are enough of the nodes initialized at startup to allow for memory >>>> hot-add into a >>>> node that was not used at boot. (See 'setup_node_data' function in >>>> 'numa.c'.) That and >>>> recording that the node was initialized. >>> >>> Is it really necessary to preinitialize these empty nodes using >>> setup_node_data()? When you do memory hotadd into a node that was not used >>> at boot, the node data already gets set up by >>> >>> add_memory >>> add_memory_resource >>> hotadd_new_pgdat >>> arch_alloc_nodedata <-- allocs the pg_data_t >>> ... >>> free_area_init_node <-- sets NODE_DATA(nid)->node_id, etc. >>> >>> Removing setup_node_data() from that loop leaves only the call to >>> node_set(). If 3af229f2071f (which reduces node_possible_map) was reverted, >>> you wouldn't need to do that either. >> >> With or without 3af229f2071f, we would still need to add something, >> somewhere to add new >> bits to the 'node_possible_map'. That is not being done. > > You mustn't add bits to the possible map after boot. > > That's its purpose, to tell you what nodes could ever *possibly* exist.
The problem that I have been encountering is that the 'possible map' did *not* show all of the possible nodes. Rather, it showed only the nodes that were assigned memory at boot-up. If more memory were hot-added to the kernel, it could be assigned into one of the nodes that were skipped at boot. However, nothing was updating the 'node_possible_map' correctly in the kernel memory code. Reza pointed out a code change in commit 3af229f2071f that has not made it into the 4.12 checkout i.e. removing the instruction that reduces the node_possible_map. This may well be a suitable replacement for the code that I have here, and I will test it here next. > > cheers > > Later. -- Michael W. Bringmann Linux Technology Center IBM Corporation Tie-Line 363-5196 External: (512) 286-5196 Cell: (512) 466-0650 m...@linux.vnet.ibm.com