On 2019/10/10 15:32, Michal Hocko wrote: > On Thu 10-10-19 14:07:21, Yunsheng Lin wrote: >> On 2019/10/9 20:25, Robin Murphy wrote: >>> On 2019-10-08 9:38 am, Yunsheng Lin wrote: >>>> On 2019/9/25 18:41, Peter Zijlstra wrote: >>>>> On Wed, Sep 25, 2019 at 05:14:20PM +0800, Yunsheng Lin wrote: >>>>>> From the discussion above, It seems making the node_to_cpumask_map() >>>>>> NUMA_NO_NODE aware is the most feasible way to move forwad. >>>>> >>>>> That's still wrong. >>>> >>>> Hi, Peter >>>> >>>> It seems this has trapped in the dead circle. >>>> >>>> From my understanding, NUMA_NO_NODE which means not node numa preference >>>> is the state to describe the node of virtual device or the physical device >>>> that has equal distance to all cpu. >>>> >>>> We can be stricter if the device does have a nearer node, but we can not >>>> deny that a device does not have a node numa preference or node affinity, >>>> which also means the control or data buffer can be allocated at the node >>>> where >>>> the process is running. >>>> >>>> As you has proposed, making it -2 and have dev_to_node() warn if the >>>> device does >>>> have a nearer node and not set by the fw is a way to be stricter. >>>> >>>> But I think maybe being stricter is not really relevant to NUMA_NO_NODE, >>>> because >>>> we does need a state to describe the device that have equal distance to >>>> all node, >>>> even if it is not physically scalable. >>>> >>>> Any better suggestion to move this forward? >>> >>> FWIW (since this is in my inbox), it sounds like the fundamental issue is >>> that NUMA_NO_NODE is conflated for at least two different purposes, so >>> trying to sort that out would be a good first step. AFAICS we have genuine >>> "don't care" cases like alloc_pages_node(), where if the producer says it >>> doesn't matter then the consumer is free to make its own judgement on what >>> to do, and fundamentally different "we expect this thing to have an >>> affinity but it doesn't, so we can't say what's appropriate" cases which >>> could really do with some separate indicator like "NUMA_INVALID_NODE". >>> >>> The tricky part is then bestowed on the producers to decide whether they >>> can downgrade "invalid" to "don't care". You can technically build 'a >>> device' whose internal logic is distributed between nodes and thus appears >>> to have equal affinity - interrupt controllers, for example, may have >>> per-CPU or per-node interfaces that end up looking like that - so although >>> it's unlikely it's not outright nonsensical. Similarly a 'device' that's >>> actually emulated behind a firmware call interface may well effectively >>> have no real affinity. >> >> We may set node of the physical device to NUMA_INVALID_NODE when fw does not >> provide one. >> >> But what do we do about NUMA_INVALID_NODE when alloc_pages_node() is called >> with nid being NUMA_INVALID_NODE? > > There is nothing sensible the allocator can do. The only point of > NUMA_INVALID_NODE would be to catch potential misconfiguration and > report them to users so they can complain to their HW/FS suppliers. > > Pushing it to other susbystem doesn't make much sense IMHO because there > is nothing really actionable. Refusing an allocation altogether sounds > like a bad plan to me. > >> If we change the node to default one(like node 0) when node of device is >> NUMA_INVALID_NODE in device_add(), how do we know the default one(like node >> 0) >> is the right one to choose? > > Exactly. We cannot really assume any node in that situation. > >> >From the privous disccusion, the below seems not get to consensus yet: >> 1) Do we need a state like NUMA_NO_NODE to describe that the device does not >> have any numa preference? > > This is a traditional meaning MM subsystem is using. > >> 2) What do we do if the fw does not provide a node for the device? Should >> we guess and pick one for it and how do we do the guessing? Or leave it >> as it is and handle it as NUMA_NO_NODE? > > As already pointed several times, picking any node is rather error > prone. You can never assume topology. We used to assume that there > always be node 0 but that is not really the case (see 3e8589963773 > ("memcg: make it work on sparse non-0-node systems")). Nodes might also > come and go so this might just lead to all sorts of subtle problems. > > On the other hand using NUMA_NO_NODE as no preference could only lead to > slightly sub optimal performance. > > I do agree with Peter that reporting a lack of affinity might be useful > but we shouldn't really try to clever and make up the affinity nilly > willy
Ok, thanks for clarify. So it seems we need to do the below if I understand it correctly: 1. fix up the node id of device that has a clear node affinity without it being set as much as possible. 2. catch the case when the device that should have a nearer node but without being set and warn about it. But I failed to see why the above is related to making node_to_cpumask_map() NUMA_NO_NODE aware? There is clear bug here too when node_to_cpumask_map() is not handling NUMA_NO_NODE. Maybe we can make the node_to_cpumask_map() NUMA_NO_NODE aware first and then deal with the above because event after the above is handled, we still need to handle NUMA_NO_NODE in node_to_cpumask_map() and it may take times to deal with the above when it is tricky to decide whether or not a device has a clear node. . >