Hi Dave, Apologies for later reply.
在 2020/11/21 1:48, Dave Hansen 写道: > On 11/17/20 6:54 PM, Shaokun Zhang wrote: >> From: Yuqi Jin <jiny...@huawei.com> >> >> In multi-processor and NUMA system, I/O driver will find cpu cores that >> which shall be bound IRQ. When cpu cores in the local numa have been >> used up, it is better to find the node closest to the local numa node >> for performance, instead of choosing any online cpu immediately. >> >> On arm64 or x86 platform that has 2-sockets and 4-NUMA nodes, if the >> network card is located in node2 of socket1, while the number queues >> of network card is greater than the number of cores of node2, when all >> cores of node2 has been bound to the queues, the remaining queues will >> be bound to the cores of node0 which is further than NUMA node3. > > That's quite the run-on sentence. :) > >> It is >> not friendly for performance or Intel's DDIO (Data Direct I/O Technology) > > Could you explain *why* it is not friendly to DDIO specifically? This > patch affects where the interrupt handler runs. But, DDIO is based on > memory locations rather than the location of the interrupt handler. > > It would be ideal to make that connection: How does the location of the > interrupt handler impact the memory allocation location? > When the interrupt handler is across chips, the BD, packet header, and even payload are required for the RX packet interrupt handler. However, the DDIO cannot transmit data to there. >> when if the user enables SNC (sub-NUMA-clustering). > > Again, the role that SNC plays here isn't spelled out. I *believe* it's > because SNC ends up reducing the number of CPUs in each NUMA node. That > makes the existing code run out of CPUs to which to bind to the "local" > node sooner. Yes. > >> +static int find_nearest_node(int node, bool *used) >> +{ >> + int i, min_dist, node_id = -1; >> + >> + /* Choose the first unused node to compare */ >> + for (i = 0; i < nr_node_ids; i++) { >> + if (used[i] == false) { >> + min_dist = node_distance(node, i); >> + node_id = i; >> + break; >> + } >> + } >> + >> + /* Compare and return the nearest node */ >> + for (i = 0; i < nr_node_ids; i++) { >> + if (node_distance(node, i) < min_dist && used[i] == false) { >> + min_dist = node_distance(node, i); >> + node_id = i; >> + } >> + } >> + >> + return node_id; >> +} >> + >> /** >> * cpumask_local_spread - select the i'th cpu with local numa cpu's first >> * @i: index number >> * @node: local numa_node >> * >> * This function selects an online CPU according to a numa aware policy; >> - * local cpus are returned first, followed by non-local ones, then it >> - * wraps around. >> + * local cpus are returned first, followed by the next one which is the >> + * nearest unused NUMA node based on NUMA distance, then it wraps around. >> * >> * It's not very efficient, but useful for setup. >> */ >> unsigned int cpumask_local_spread(unsigned int i, int node) > > FWIW, I think 'i' is criminally bad naming. It should be called > nr_cpus_to_skip or something similar. > Ok, I really didn't consider this parameter naming before. > I also detest the comments that are there today. > > Loop through all the online CPUs on the system. Start with the > CPUs on 'node', then fall back to CPUs on NUMA nodes which are > increasingly far away. > > Skip the first 'nr_cpus_to_skip' CPUs which are found. > > This function is not very efficient, especially for large > 'nr_cpus_to_skip' because it loops over the same CPUs on each > call and does not remember its state from previous calls. > Shame for my bad comment, I will follow it. >> { >> - int cpu, hk_flags; >> + static DEFINE_SPINLOCK(spread_lock); >> + static bool used[MAX_NUMNODES]; > > I thought I mentioned this last time. How large is this array? How > large would it be if it were a nodemask_t? Would this be less code if Apologies that I forgot to do it. > you just dynamically allocated and freed the node mask instead of having > a spinlock and a memset? > Ok, but I think the spinlock is also needed, do I miss something? >> + unsigned long flags; >> + int cpu, hk_flags, j, id; >> const struct cpumask *mask; >> >> hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ; >> @@ -352,20 +379,27 @@ unsigned int cpumask_local_spread(unsigned int i, int >> node) >> return cpu; >> } >> } else { >> - /* NUMA first. */ >> - for_each_cpu_and(cpu, cpumask_of_node(node), mask) { >> - if (i-- == 0) >> - return cpu; >> + spin_lock_irqsave(&spread_lock, flags); >> + memset(used, 0, nr_node_ids * sizeof(bool)); >> + /* select node according to the distance from local node */ >> + for (j = 0; j < nr_node_ids; j++) { >> + id = find_nearest_node(node, used); >> + if (id < 0) >> + break; > > There's presumably an outer loop in a driver which is trying to bind a > bunch of interrupts to a bunch of CPUs. We know there are on the order > of dozens of these interrupts. > > for_each_interrupt() // in the driver > for (j=0;j<nr_node_ids;j++) // cpumask_local_spread() > // find_nearest_node(): > for (i = 0; i < nr_node_ids; i++) { > for (i = 0; i < nr_node_ids; i++) { > > Does this worry anybody else? It thought our upper limits on the number > of NUMA nodes was 1024. Doesn't that make our loop O(N^3) where the > worst case is hundreds of millions of loops? > If the NUMA nodes is 1024 in real system, it is more worthy to find the earest node, rather than choose a random one, And it is only called in I/O device initialization. Comments also are given to this interface. > I don't want to prematurely optimize this, but that seems like something > that might just fall over on bigger systems. > > This also seems really wasteful if we have a bunch of memory-only nodes. > Each of those will be found via find_nearest_node(), but then this loop: > Got it, all effort is used to choose the nearest node for performance. If we don't it, I think some one will also debug this in future. >> + for_each_cpu_and(cpu, cpumask_of_node(id), mask) >> + if (i-- == 0) { >> + spin_unlock_irqrestore(&spread_lock, >> + flags); >> + return cpu; >> + } >> + used[id] = true; >> } > > Will just exit immediately because cpumask_of_node() is empty. Yes, and this node used[id] became true. > > 'used', for instance, should start by setting 'true' for all nodes which > are not in N_CPUS. No, because I used 'nr_node_ids' which is possible node ids to check. Thanks, Shaokun > >> + spin_unlock_irqrestore(&spread_lock, flags); >> >> - for_each_cpu(cpu, mask) { >> - /* Skip NUMA nodes, done above. */ >> - if (cpumask_test_cpu(cpu, cpumask_of_node(node))) >> - continue; >> - >> + for_each_cpu(cpu, mask) >> if (i-- == 0) >> return cpu; >> - } >> } >> BUG(); >> } > . >