Hi Dave, 在 2020/11/14 0:02, Dave Hansen 写道: > On 11/12/20 6:06 PM, Shaokun Zhang wrote: >>>> On Huawei Kunpeng 920 server, there are 4 NUMA node(0 - 3) in the 2-cpu >>>> system(0 - 1). The topology of this server is followed: >>> >>> This is with a feature enabled that Intel calls sub-NUMA-clustering >>> (SNC), right? Explaining *that* feature would also be great context for >> >> Correct, >> >>> why this gets triggered on your system and not normally on others and >>> why nobody noticed this until now. >> >> This is on intel 6248 platform: > > I have no idea what a "6248 platform" is. >
My apologies that it's Cascade Lake, [1] >>>> +static void calc_node_distance(int *node_dist, int node) >>>> +{ >>>> + int i; >>>> + >>>> + for (i = 0; i < nr_node_ids; i++) >>>> + node_dist[i] = node_distance(node, i); >>>> +} >>> >>> This appears to be the only place node_dist[] is written. That means it >>> always contains a one-dimensional slice of the two-dimensional data >>> represented by node_distance(). >>> >>> Why is a copy of this data needed? >> >> It is used to store the distance with the @node for later, apologies that I >> can't follow your question correctly. > > Right, the data that you store is useful. *But*, it's also a verbatim > copy of the data from node_distance(). Why not just use node_distance() > directly in your code rather than creating a partial copy of it in the Ok, I will remove this redundant function in next version. > local node_dist[] array? > > >>>> unsigned int cpumask_local_spread(unsigned int i, int node) >>>> { >>>> - int cpu, hk_flags; >>>> + static DEFINE_SPINLOCK(spread_lock); >>>> + static int node_dist[MAX_NUMNODES]; >>>> + static bool used[MAX_NUMNODES]; >>> >>> Not to be *too* picky, but there is a reason we declare nodemask_t as a >>> bitmap and not an array of bools. Isn't this just wasteful? >>> >>>> + unsigned long flags; >>>> + int cpu, hk_flags, j, id; >>>> const struct cpumask *mask; >>>> >>>> hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ; >>>> @@ -220,20 +256,28 @@ unsigned int cpumask_local_spread(unsigned int i, >>>> int node) >>>> return cpu; >>>> } >>>> } else { >>>> - /* NUMA first. */ >>>> - for_each_cpu_and(cpu, cpumask_of_node(node), mask) { >>>> - if (i-- == 0) >>>> - return cpu; >>>> - } >>>> + spin_lock_irqsave(&spread_lock, flags); >>>> + memset(used, 0, nr_node_ids * sizeof(bool)); >>>> + calc_node_distance(node_dist, node); >>>> + /* Local node first then the nearest node is used */ >>> >>> Is this comment really correct? This makes it sound like there is only >> >> I think it is correct, that's what we want to choose the nearest node. >> >>> fallback to a single node. Doesn't the _code_ fall back basically >>> without limit? >> >> If I follow your question correctly, without this patch, if the local >> node is used up, one random node will be choosed, right? Now we firstly >> choose the nearest node by the distance, if all nodes has been choosen, >> it will return the initial solution. > > The comment makes it sound like the code does: > 1. Do the local node > 2. Do the next nearest node > 3. Stop > That's more clear, I will udpate the comments as the new patch. > In reality, I *think* it's more of a loop where it search > ever-increasing distances away from the local node. > > I just think the comment needs to be made more precise. Got it. > >>>> + for (j = 0; j < nr_node_ids; j++) { >>>> + id = find_nearest_node(node_dist, used); >>>> + if (id < 0) >>>> + break; >>>> >>>> - for_each_cpu(cpu, mask) { >>>> - /* Skip NUMA nodes, done above. */ >>>> - if (cpumask_test_cpu(cpu, cpumask_of_node(node))) >>>> - continue; >>>> + for_each_cpu_and(cpu, cpumask_of_node(id), mask) >>>> + if (i-- == 0) { >>>> + spin_unlock_irqrestore(&spread_lock, >>>> + flags); >>>> + return cpu; >>>> + } >>>> + used[id] = 1; >>>> + } >>>> + spin_unlock_irqrestore(&spread_lock, flags); >>> >>> The existing code was pretty sparsely commented. This looks to me to >>> make it more complicated and *less* commented. Not the best combo. >> >> Apologies for the bad comments, hopefully I describe it clearly by the above >> explantion. > > Do you want to take another pass at submitting this patch? 'Another pass'? Sorry for my bad understading, I don't follow it correctly. Thanks, Shaokun [1]https://en.wikichip.org/wiki/intel/xeon_gold/6248 > . >