Srikar Dronamraju <sri...@linux.vnet.ibm.com> writes: > * Aneesh Kumar K.V <aneesh.ku...@linux.ibm.com> [2020-08-02 19:51:41]: >> Srikar Dronamraju <sri...@linux.vnet.ibm.com> writes: >> > * Aneesh Kumar K.V <aneesh.ku...@linux.ibm.com> [2020-07-31 16:49:14]: >> > >> > >> > If its just to eliminate node 0, then we have 2 other probably better >> > solutions. >> > 1. Dont mark node 0 as spl (currently still in mm-tree and a result in >> > linux-next) >> > 2. powerpc specific: explicitly clear node 0 during numa bringup. >> > >> >> >> I am not sure I consider them better. But yes, those patches are good >> and also resolves the node 0 initialization when the firmware didn't >> indicate the presence of such a node. >> >> This patch in addition make sure that we get the same topolgy report >> across reboot on a virtualized partitions as longs as the cpu/memory >> ratio per powervm domains remain the same. This should also help to >> avoid confusion after an LPM migration once we start applying topology >> updates. >> > > What do we mean by cpu/memory ratio. The topology across reboot would have > changed only if PowerVM would have allocated resources differently by > scrambling/unscrambling. We are no more processing topology updates at > runtime. As far as I know, after LPM, the source topology is maintained.
A LPAR running with one numa node and 10GB of memory on PowerVM domain 10 will report node 10 and 10GB memory in the current scheme. After LPM migration or a CEC shutdown/reboot if the domain from which the resource allocated becomes 11, then the LPAR will report node 11 and 10GB memory. Having a logical node number means in the both the above cases we report node 0, 10GB memory. > >> >> This can be resolved by mapping the firmware provided group id to a >> >> logical Linux >> >> NUMA id. In this patch, we do this only for pseries platforms considering >> >> the >> > >> > On PowerVM, as you would know the nid is already a logical or a flattened >> > chip-id and not the actual hardware chip-id. >> >> Yes. But then they are derived based on PowerVM resources AKA domains. >> Now based on the available resource on a system, we could end up with >> different node numbers with same toplogy across reboots. Making it >> logical at OS level prevent that. > > The above statement kind of gives an impression, that topology changes > across every reboot. We only end up with different node numbers if and only > if the underlying topology has changed and that case is very rare. Or am I > missing something? IIUC it also depends on availability of resources within the domain at the time of LPAR start. > >> >> >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c >> >> index e437a9ac4956..6c659aada55b 100644 >> >> --- a/arch/powerpc/mm/numa.c >> >> +++ b/arch/powerpc/mm/numa.c >> >> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int >> >> nid, >> >> } >> >> } >> >> >> >> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = >> >> NUMA_NO_NODE}; >> >> + >> >> +int firmware_group_id_to_nid(int firmware_gid) >> >> +{ >> >> + static int last_nid = 0; >> >> + >> >> + /* >> >> + * For PowerNV we don't change the node id. This helps to avoid >> >> + * confusion w.r.t the expected node ids. On pseries, node numbers >> >> + * are virtualized. Hence do logical node id for pseries. >> >> + */ >> >> + if (!firmware_has_feature(FW_FEATURE_LPAR)) >> >> + return firmware_gid; >> >> + >> >> + if (firmware_gid == -1) >> >> + return NUMA_NO_NODE; >> >> + >> >> + if (nid_map[firmware_gid] == NUMA_NO_NODE) >> >> + nid_map[firmware_gid] = last_nid++; >> > >> > How do we ensure 2 simultaneous firmware_group_id_to_nid() calls dont end >> > up >> > at this place in parallel? >> >> Do we have a code path where we do that? All the node id init should >> happen early and there should not be two cpus doing node init at the >> same time. I might be mistaken. Can you point to the code path where you >> expect this to be called in parallel? >> > > associativity_to_nid gets called the first time a cpu is being made present > from offline. So it need not be in boot path. We may to verify if cpu > hotplug, dlpar, operations are synchronized. For example a memory hotadd and > cpu hotplug are they synchronized? I am not sure if they are synchronized at > this time. But you don't online cpu or memory to a non existent node post boot right?. If the node is existent we have already initialized the nid_map. However i am not sure whether we do a parallel initialization of devices. ie, of_device_add getting called in parallel. if it can then we need the below? @@ -226,6 +226,7 @@ static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = NUMA_NO_NODE}; int firmware_group_id_to_nid(int firmware_gid) { static int last_nid = 0; + static DEFINE_SPINLOCK(node_id_lock); /* * For PowerNV we don't change the node id. This helps to avoid @@ -238,8 +239,13 @@ int firmware_group_id_to_nid(int firmware_gid) if (firmware_gid == -1) return NUMA_NO_NODE; - if (nid_map[firmware_gid] == NUMA_NO_NODE) - nid_map[firmware_gid] = last_nid++; + if (nid_map[firmware_gid] == NUMA_NO_NODE) { + spin_lock(&node_id_lock); + /* recheck with lock held */ + if (nid_map[firmware_gid] == NUMA_NO_NODE) + nid_map[firmware_gid] = last_nid++; + spin_unlock(&node_id_lock); + } return nid_map[firmware_gid]; } > >> > >> >> + >> >> + return nid_map[firmware_gid]; >> >> +} >> >> + >> >> /* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa >> >> * info is found. >> >> */ >> >> static int associativity_to_nid(const __be32 *associativity) >> >> { >> >> int nid = NUMA_NO_NODE; >> >> + int firmware_gid = -1; >> >> >> >> if (!numa_enabled) >> >> goto out; >> >> >> >> if (of_read_number(associativity, 1) >= min_common_depth) >> >> - nid = of_read_number(&associativity[min_common_depth], 1); >> >> + firmware_gid = of_read_number(&associativity[min_common_depth], >> >> 1); >> >> >> >> /* POWER4 LPAR uses 0xffff as invalid node */ >> >> - if (nid == 0xffff || nid >= MAX_NUMNODES) >> >> - nid = NUMA_NO_NODE; >> >> + if (firmware_gid == 0xffff || firmware_gid >= MAX_NUMNODES) >> >> + firmware_gid = -1; >> > >> > Lets assume two or more invocations of associativity_to_nid for the same >> > associativity, end up with -1, In each case aren't giving different >> > nids? >> >> >> I didn't quiet get the comment here. But I assume you are indicating the >> same one you mentioned above? >> > > No its not related to the above comment. > We are incrementing the nid_map table for every unique firmware_gid or for > every -1 (aka invalid associativities). If there are sufficiently large > number of associativities that end up returning invalid associativities, > then don't we quickly overflow the nid_map table? Not only about the > overflow but a 8 node machine may soon look like a 80 node machine. Not sure I follow. What does a large number of associativies imply? Are you looking at ibm,associativity-lookup-arrays that got entries which are invalid? Even there we are not parsing the full array, we lookup only a specific firmware_gid (in case of lookup-arrays we use aa_index value from drmem_lmb). I will also add a las_nid > MAX_NUMNODES check in firmware_group_id_to_nid() to handle the case where we find more numa nodes than MAX_NUMANODES in device tree. -aneesh