Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes

Cédric Le Goater Thu, 01 Apr 2021 02:21:31 -0700

On 4/1/21 4:59 AM, David Gibson wrote:
> On Wed, Mar 31, 2021 at 05:18:45PM +0200, Cédric Le Goater wrote:
>> On 3/31/21 2:57 AM, David Gibson wrote:
>>> On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote:
>>>>
>>>>
>>>> On 3/29/21 12:32 PM, Cédric Le Goater wrote:
>>>>> On 3/29/21 6:20 AM, David Gibson wrote:
>>>>>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:
>>>>>>> On 3/25/21 3:10 AM, David Gibson wrote:
>>>>>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3/22/21 10:03 PM, David Gibson wrote:
>>>>>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza 
>>>>>>>>>> wrote:
>>>>>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
>>>>>>>>>>> topology_core_cpumask") cause a regression in the pseries machine 
>>>>>>>>>>> when
>>>>>>>>>>> defining certain SMP topologies [1]. The reasoning behind the 
>>>>>>>>>>> change is
>>>>>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
>>>>>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles 
>>>>>>>>>>> with
>>>>>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask 
>>>>>>>>>>> because, as
>>>>>>>>>>> far as the kernel understanding of SMP topologies goes, both masks 
>>>>>>>>>>> are
>>>>>>>>>>> equivalent.
>>>>>>>>>>>
>>>>>>>>>>> Further discussions in the kernel mailing list [2] shown that the
>>>>>>>>>>> powerpc kernel always considered that the number of sockets were 
>>>>>>>>>>> equal
>>>>>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make 
>>>>>>>>>>> sense,
>>>>>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA 
>>>>>>>>>>> node. The
>>>>>>>>>>> immediate conclusion is that all SMP topologies the pseries machine 
>>>>>>>>>>> were
>>>>>>>>>>> supplying to the kernel, with more than one socket in the same NUMA 
>>>>>>>>>>> node
>>>>>>>>>>> as in [1], happened to be correctly represented in the kernel by
>>>>>>>>>>> accident during all these years.
>>>>>>>>>>>
>>>>>>>>>>> There's a case to be made for virtual topologies being detached from
>>>>>>>>>>> hardware constraints, allowing maximum flexibility to users. At the 
>>>>>>>>>>> same
>>>>>>>>>>> time, this freedom can't result in unrealistic hardware 
>>>>>>>>>>> representations
>>>>>>>>>>> being emulated. If the real hardware and the pseries kernel don't
>>>>>>>>>>> support multiple chips/sockets in the same NUMA node, neither 
>>>>>>>>>>> should we.
>>>>>>>>>>>
>>>>>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
>>>>>>>>>>> pseries machine. qtest changes were made to adapt to this new
>>>>>>>>>>> condition.
>>>>>>>>>>
>>>>>>>>>> Oof.  I really don't like this idea.  It means a bunch of fiddly work
>>>>>>>>>> for users to match these up, for no real gain.  I'm also concerned
>>>>>>>>>> that this will require follow on changes in libvirt to not make this 
>>>>>>>>>> a
>>>>>>>>>> really cryptic and irritating point of failure.
>>>>>>>>>
>>>>>>>>> Haven't though about required Libvirt changes, although I can say 
>>>>>>>>> that there
>>>>>>>>> will be some amount to be mande and it will probably annoy existing 
>>>>>>>>> users
>>>>>>>>> (everyone that has a multiple socket per NUMA node topology).
>>>>>>>>>
>>>>>>>>> There is not much we can do from the QEMU layer aside from what I've 
>>>>>>>>> proposed
>>>>>>>>> here. The other alternative is to keep interacting with the kernel 
>>>>>>>>> folks to
>>>>>>>>> see if there is a way to keep our use case untouched.
>>>>>>>>
>>>>>>>> Right.  Well.. not necessarily untouched, but I'm hoping for more
>>>>>>>> replies from Cédric to my objections and mpe's.  Even with sockets
>>>>>>>> being a kinda meaningless concept in PAPR, I don't think tying it to
>>>>>>>> NUMA nodes makes sense.
>>>>>>>
>>>>>>> I did a couple of replies in different email threads but maybe not
>>>>>>> to all. I felt it was going nowhere :/ Couple of thoughts,
>>>>>>
>>>>>> I think I saw some of those, but maybe not all.
>>>>>>
>>>>>>> Shouldn't we get rid of the socket concept, die also, under pseries
>>>>>>> since they don't exist under PAPR ? We only have numa nodes, cores,
>>>>>>> threads AFAICT.
>>>>>>
>>>>>> Theoretically, yes.  I'm not sure it's really practical, though, since
>>>>>> AFAICT, both qemu and the kernel have the notion of sockets (though
>>>>>> not dies) built into generic code.
>>>>>
>>>>> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc"
>>>>> and PPC Linux only has a NUMA node id, on pseries and powernv.
>>>>>
>>>>>> It does mean that one possible approach here - maybe the best one - is
>>>>>> to simply declare that sockets are meaningless under, so we simply
>>>>>> don't expect what the guest kernel reports to match what's given to
>>>>>> qemu.
>>>>>>
>>>>>> It'd be nice to avoid that if we can: in a sense it's just cosmetic,
>>>>>> but it is likely to surprise and confuse people.
>>>>>>
>>>>>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ?
>>>>>>> There are a couple of places where Linux checks for the underlying
>>>>>>> hypervisor already.
>>>>>>>
>>>>>>>>> This also means that
>>>>>>>>> 'ibm,chip-id' will probably remain in use since it's the only place 
>>>>>>>>> where
>>>>>>>>> we inform cores per socket information to the kernel.
>>>>>>>>
>>>>>>>> Well.. unless we can find some other sensible way to convey that
>>>>>>>> information.  I haven't given up hope for that yet.
>>>>>>>
>>>>>>> Well, we could start by fixing the value in QEMU. It is broken
>>>>>>> today.
>>>>>>
>>>>>> Fixing what value, exactly?
>>>>>
>>>>> The value of the "ibm,chip-id" since we are keeping the property under
>>>>> QEMU.
>>>>
>>>> David, I believe this has to do with the discussing we had last Friday.
>>>>
>>>> I mentioned that the ibm,chip-id property is being calculated in a way that
>>>> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes,
>>>> e.g.:
>>>>
>>>> -smp 4,cores=4,maxcpus=8,threads=1 \
>>>> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \
>>>> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>>>
>>>>
>>>> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id
>>>>                    ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>;
>>>>                    ibm,pft-size = <0x00 0x19>;
>>>>                    ibm,chip-id = <0x00>;
>>>
>>>> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a
>>>> different NUMA node than 0-1. This would mean that the same socket
>>>> would belong to different NUMA nodes at the same time.
>>>
>>> Right... and I'm still not seeing why that's a problem.  AFAICT that's
>>> a possible, if unexpected, situation under real hardware - though
>>> maybe not for POWER9 specifically.
>> The ibm,chip-id property does not exist under PAPR. PAPR only has
>> NUMA nodes, no sockets nor chips. 
>>
>> And the property value is simply broken under QEMU. Try this  :
>>
>>    -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G 
>> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object 
>> memory-backend-ram,id=ram-node1,size=2G -numa 
>> node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>
>> # dmesg | grep numa
>> [    0.013106] numa: Node 0 CPUs: 0-1
>> [    0.013136] numa: Node 1 CPUs: 2-3
>>
>> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
>>              ibm,chip-id = <0x01>;
>>              ibm,chip-id = <0x02>;
>>              ibm,chip-id = <0x00>;
>>              ibm,chip-id = <0x03>;
> 
> Not seeing the problem here.  You've specified 8 (possible) cpus, 1
> core-per-socket, therefore there are 4 sockets, hence 4 chip-ids.
> Again, I don't see where this assumption that the chip-id is somehow
> related to the NUMA topology is coming from.
> 
>>>> I believe this is what Cedric wants to be addressed. Given that the
>>>> property is called after the OPAL property ibm,chip-id, the kernel
>>>> expects that the property will have the same semantics as in OPAL.>
>>> Even on powernv, I'm not clear why chip-id is tied into the NUMA
>>> configuration, rather than getting all the NUMA info from
>>> associativity properties.
>>
>> It is the case. 
> 
> What exactly is the case?


sorry. That was badly phrased. I meant ; powernv gets the NUMA info from
"ibm,associativity" property.

C.

>> The associativity properties are built from chip-id in OPAL though.
> 
> Ok, so?  Why do we care how OPAL builds the associativity properties
> once we *have* the associativity properties its built?
> 
>> The chip-id property is only used in low level PowerNV drivers, VAS,
>> XSCOM, LPC, etc.
> 
> Which AFAIK, really *do* need to know which chip they're on, not the
> NUMA toplogy.
> 
>> It's also badly used in the common part of the XIVE driver, what I am
>> trying to fix to introduce an IPI per node on all platforms.   
> 
> See comments on other thread.
>

Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes

Reply via email to