Hi Aneesh, "Aneesh Kumar K.V" <aneesh.ku...@linux.ibm.com> writes: > "Aneesh Kumar K.V" <aneesh.ku...@linux.ibm.com> writes: >> On 8/8/20 2:15 AM, Nathan Lynch wrote: >>> "Aneesh Kumar K.V" <aneesh.ku...@linux.ibm.com> writes: >>>> On 8/7/20 9:54 AM, Nathan Lynch wrote: >>>>> "Aneesh Kumar K.V" <aneesh.ku...@linux.ibm.com> writes: >>>>>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c >>>>>> index e437a9ac4956..6c659aada55b 100644 >>>>>> --- a/arch/powerpc/mm/numa.c >>>>>> +++ b/arch/powerpc/mm/numa.c >>>>>> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int >>>>>> nid, >>>>>> } >>>>>> } >>>>>> >>>>>> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = >>>>>> NUMA_NO_NODE}; >>>>> >>>>> It's odd to me to use MAX_NUMNODES for this array when it's going to be >>>>> indexed not by Linux's logical node IDs but by the platform-provided >>>>> domain number, which has no relation to MAX_NUMNODES. >>>> >>>> >>>> I didn't want to dynamically allocate this. We could fetch >>>> "ibm,max-associativity-domains" to find the size for that. The current >>>> code do assume firmware group id to not exceed MAX_NUMNODES. Hence kept >>>> the array size to be MAX_NUMNODEs. I do agree that it is confusing. May >>>> be we can do #define MAX_AFFINITY_DOMAIN MAX_NUMNODES? >>> >>> Well, consider: >>> >>> - ibm,max-associativity-domains can change at runtime with LPM. This >>> doesn't happen in practice yet, but we should probably start thinking >>> about how to support that. >>> - The domain numbering isn't clearly specified to have any particular >>> properties such as beginning at zero or a contiguous range. >>> >>> While the current code likely contains assumptions contrary to these >>> points, a change such as this is an opportunity to think about whether >>> those assumptions can be reduced or removed. In particular I think it >>> would be good to gracefully degrade when the number of NUMA affinity >>> domains can exceed MAX_NUMNODES. Using the platform-supplied domain >>> numbers to directly index Linux data structures will make that >>> impossible. >>> >>> So, maybe genradix or even xarray wouldn't actually be overengineering >>> here. >>> >> >> One of the challenges with such a data structure is that we initialize >> the nid_map before the slab is available. This means a memblock based >> allocation and we would end up implementing such a sparse data structure >> ourselves here.
Yes, good point. >> As you mentioned above, since we know that hypervisor as of now limits >> the max affinity domain id below ibm,max-associativity-domains we are >> good with an array-like nid_map we have here. This keeps the code simpler. >> >> This will also allow us to switch to a more sparse data structure as you >> requested here in the future because the main change that is pushed in >> this series is the usage of firmare_group_id_to_nid(). The details of >> the data structure we use to keep track of that mapping are pretty much >> internal to that function. > > How about this? This makes it not a direct index. But it do limit the > search to max numa node on the system. > > static int domain_id_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] = -1 }; > > static int __affinity_domain_to_nid(int domain_id, int max_nid) > { > int i; > > for (i = 0; i < max_nid; i++) { > if (domain_id_map[i] == domain_id) > return i; > } > return NUMA_NO_NODE; > } OK, this indexes the array by Linux's node id, good. I was wondering if I could persuade you do flip it around like this :-) Walking through the code below: > int affinity_domain_to_nid(struct affinity_domain *domain) > { > int nid, domain_id; > static int last_nid = 0; > static DEFINE_SPINLOCK(node_id_lock); > > domain_id = domain->id; > /* > * For PowerNV we don't change the node id. This helps to avoid > * confusion w.r.t the expected node ids. On pseries, node numbers > * are virtualized. Hence do logical node id for pseries. > */ > if (!firmware_has_feature(FW_FEATURE_LPAR)) > return domain_id; > > if (domain_id == -1 || last_nid == MAX_NUMNODES) > return NUMA_NO_NODE; > > nid = __affinity_domain_to_nid(domain_id, last_nid); So this is pseries fast path. Attempt to look up the Linux node for the given domain, where last_nid is the highest-numbered node in use so far. If the result is in [0..last_nid] we're done. > > if (nid == NUMA_NO_NODE) { > spin_lock(&node_id_lock); If the lookup fails enter the critical section. As we discussed offline, this is a precaution for potentially parallel device probing. > /* recheck with lock held */ > nid = __affinity_domain_to_nid(domain_id, last_nid); Attempt the same lookup again. If the result is in [0..last_nid], another thread has just initialized the mapping for this domain and we're done. > if (nid == NUMA_NO_NODE) { > nid = last_nid++; > domain_id_map[nid] = domain_id; > } If the lookup fails again, "allocate" the next unused Linux node number. Otherwise use the result returned by the second call to __affinity_domain_to_nid(). > spin_unlock(&node_id_lock); > } > > return nid; > } Generally I agree with this approach. I don't quite get the locking. I understand there could be a need for a lockless fast path, but as written I don't think last_nid is appropriately protected - two slow-path threads could cause an increment to be "lost" since last_nid is loaded before taking the lock.