Hi Dave, Apologies for the late reply.
在 2020/12/1 1:08, Dave Hansen 写道: >>>> { >>>> - int cpu, hk_flags; >>>> + static DEFINE_SPINLOCK(spread_lock); >>>> + static bool used[MAX_NUMNODES]; >>> >>> I thought I mentioned this last time. How large is this array? How >>> large would it be if it were a nodemask_t? Would this be less code if >> >> Apologies that I forgot to do it. >> >>> you just dynamically allocated and freed the node mask instead of having >>> a spinlock and a memset? >> >> Ok, but I think the spinlock is also needed, do I miss something? > > There was no spinlock there before your patch. You just need it to > protect the structures you declared static. If you didn't have static > structures, you wouldn't need a lock. Got it, I will allocate it dynamically. > >>>> + unsigned long flags; >>>> + int cpu, hk_flags, j, id; >>>> const struct cpumask *mask; >>>> >>>> hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ; >>>> @@ -352,20 +379,27 @@ unsigned int cpumask_local_spread(unsigned int i, >>>> int node) >>>> return cpu; >>>> } >>>> } else { >>>> - /* NUMA first. */ >>>> - for_each_cpu_and(cpu, cpumask_of_node(node), mask) { >>>> - if (i-- == 0) >>>> - return cpu; >>>> + spin_lock_irqsave(&spread_lock, flags); >>>> + memset(used, 0, nr_node_ids * sizeof(bool)); >>>> + /* select node according to the distance from local node */ >>>> + for (j = 0; j < nr_node_ids; j++) { >>>> + id = find_nearest_node(node, used); >>>> + if (id < 0) >>>> + break; >>> >>> There's presumably an outer loop in a driver which is trying to bind a >>> bunch of interrupts to a bunch of CPUs. We know there are on the order >>> of dozens of these interrupts. >>> >>> for_each_interrupt() // in the driver >>> for (j=0;j<nr_node_ids;j++) // cpumask_local_spread() >>> // find_nearest_node(): >>> for (i = 0; i < nr_node_ids; i++) { >>> for (i = 0; i < nr_node_ids; i++) { >>> >>> Does this worry anybody else? It thought our upper limits on the number >>> of NUMA nodes was 1024. Doesn't that make our loop O(N^3) where the >>> worst case is hundreds of millions of loops? >> >> If the NUMA nodes is 1024 in real system, it is more worthy to find the >> earest node, rather than choose a random one, And it is only called in >> I/O device initialization. Comments also are given to this interface. > > This doesn't really make me feel better. An end user booting this on a My bad, I only want to explain the issue. > big system with a bunch of cards could see a minutes-long delay. I can Indeed. > also see funky stuff happening like if we have a ton of NUMA nodes and > few CPUs. > >>> I don't want to prematurely optimize this, but that seems like something >>> that might just fall over on bigger systems. >>> >>> This also seems really wasteful if we have a bunch of memory-only nodes. >>> Each of those will be found via find_nearest_node(), but then this loop: >> >> Got it, all effort is used to choose the nearest node for performance. If >> we don't it, I think some one will also debug this in future. > > If we're going to kick the can down the road for some poor sod to debug, > can we at least help them out with a warning? > > Maybe we WARN_ONCE() after we fall back for more than 2 or 3 nodes. > Ok, > But, I still don't think you've addressed my main concern: This is > horrifically inefficient searching for CPUs inside nodes that are known > to have no CPUs. How about optimizing as follows: + for (j = 0; j < nr_node_ids; j++) { + id = find_nearest_node(node, nodes); + if (id < 0) + break; + nmask = cpumask_of_node(id); + cpumask_and(&node_possible_mask, &mask, & nmask); + cpu_of_node = cpumask_weight(node_possible_mask); + if (cpu_index > cpu_of_node) { + cpu_index -= cpu_of_node; + node_set(id, nodes); + continue; + } + + for_each_cpu(cpu, node_possible_mask) + if (cpu_index-- == 0) + return cpu; + + node_set(id, nodes); } > >>>> + for_each_cpu_and(cpu, cpumask_of_node(id), mask) >>>> + if (i-- == 0) { >>>> + spin_unlock_irqrestore(&spread_lock, >>>> + flags); >>>> + return cpu; >>>> + } >>>> + used[id] = true; >>>> } >>> >>> Will just exit immediately because cpumask_of_node() is empty. >> >> Yes, and this node used[id] became true. >> >>> >>> 'used', for instance, should start by setting 'true' for all nodes which >>> are not in N_CPUS. >> >> No, because I used 'nr_node_ids' which is possible node ids to check. > > I'm saying that it's wasteful to loop over and search in all the nodes. If you are happy the mentioned code, it also will solve the issue. Thanks, Shaokun > . >