Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes

Cédric Le Goater Wed, 31 Mar 2021 08:20:04 -0700

On 3/31/21 2:57 AM, David Gibson wrote:
> On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote:
>>
>>
>> On 3/29/21 12:32 PM, Cédric Le Goater wrote:
>>> On 3/29/21 6:20 AM, David Gibson wrote:
>>>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote:
>>>>> On 3/25/21 3:10 AM, David Gibson wrote:
>>>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 3/22/21 10:03 PM, David Gibson wrote:
>>>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza 
>>>>>>>> wrote:
>>>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update
>>>>>>>>> topology_core_cpumask") cause a regression in the pseries machine when
>>>>>>>>> defining certain SMP topologies [1]. The reasoning behind the change 
>>>>>>>>> is
>>>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating
>>>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles 
>>>>>>>>> with
>>>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, 
>>>>>>>>> as
>>>>>>>>> far as the kernel understanding of SMP topologies goes, both masks are
>>>>>>>>> equivalent.
>>>>>>>>>
>>>>>>>>> Further discussions in the kernel mailing list [2] shown that the
>>>>>>>>> powerpc kernel always considered that the number of sockets were equal
>>>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense,
>>>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. 
>>>>>>>>> The
>>>>>>>>> immediate conclusion is that all SMP topologies the pseries machine 
>>>>>>>>> were
>>>>>>>>> supplying to the kernel, with more than one socket in the same NUMA 
>>>>>>>>> node
>>>>>>>>> as in [1], happened to be correctly represented in the kernel by
>>>>>>>>> accident during all these years.
>>>>>>>>>
>>>>>>>>> There's a case to be made for virtual topologies being detached from
>>>>>>>>> hardware constraints, allowing maximum flexibility to users. At the 
>>>>>>>>> same
>>>>>>>>> time, this freedom can't result in unrealistic hardware 
>>>>>>>>> representations
>>>>>>>>> being emulated. If the real hardware and the pseries kernel don't
>>>>>>>>> support multiple chips/sockets in the same NUMA node, neither should 
>>>>>>>>> we.
>>>>>>>>>
>>>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the
>>>>>>>>> pseries machine. qtest changes were made to adapt to this new
>>>>>>>>> condition.
>>>>>>>>
>>>>>>>> Oof.  I really don't like this idea.  It means a bunch of fiddly work
>>>>>>>> for users to match these up, for no real gain.  I'm also concerned
>>>>>>>> that this will require follow on changes in libvirt to not make this a
>>>>>>>> really cryptic and irritating point of failure.
>>>>>>>
>>>>>>> Haven't though about required Libvirt changes, although I can say that 
>>>>>>> there
>>>>>>> will be some amount to be mande and it will probably annoy existing 
>>>>>>> users
>>>>>>> (everyone that has a multiple socket per NUMA node topology).
>>>>>>>
>>>>>>> There is not much we can do from the QEMU layer aside from what I've 
>>>>>>> proposed
>>>>>>> here. The other alternative is to keep interacting with the kernel 
>>>>>>> folks to
>>>>>>> see if there is a way to keep our use case untouched.
>>>>>>
>>>>>> Right.  Well.. not necessarily untouched, but I'm hoping for more
>>>>>> replies from Cédric to my objections and mpe's.  Even with sockets
>>>>>> being a kinda meaningless concept in PAPR, I don't think tying it to
>>>>>> NUMA nodes makes sense.
>>>>>
>>>>> I did a couple of replies in different email threads but maybe not
>>>>> to all. I felt it was going nowhere :/ Couple of thoughts,
>>>>
>>>> I think I saw some of those, but maybe not all.
>>>>
>>>>> Shouldn't we get rid of the socket concept, die also, under pseries
>>>>> since they don't exist under PAPR ? We only have numa nodes, cores,
>>>>> threads AFAICT.
>>>>
>>>> Theoretically, yes.  I'm not sure it's really practical, though, since
>>>> AFAICT, both qemu and the kernel have the notion of sockets (though
>>>> not dies) built into generic code.
>>>
>>> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc"
>>> and PPC Linux only has a NUMA node id, on pseries and powernv.
>>>
>>>> It does mean that one possible approach here - maybe the best one - is
>>>> to simply declare that sockets are meaningless under, so we simply
>>>> don't expect what the guest kernel reports to match what's given to
>>>> qemu.
>>>>
>>>> It'd be nice to avoid that if we can: in a sense it's just cosmetic,
>>>> but it is likely to surprise and confuse people.
>>>>
>>>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ?
>>>>> There are a couple of places where Linux checks for the underlying
>>>>> hypervisor already.
>>>>>
>>>>>>> This also means that
>>>>>>> 'ibm,chip-id' will probably remain in use since it's the only place 
>>>>>>> where
>>>>>>> we inform cores per socket information to the kernel.
>>>>>>
>>>>>> Well.. unless we can find some other sensible way to convey that
>>>>>> information.  I haven't given up hope for that yet.
>>>>>
>>>>> Well, we could start by fixing the value in QEMU. It is broken
>>>>> today.
>>>>
>>>> Fixing what value, exactly?
>>>
>>> The value of the "ibm,chip-id" since we are keeping the property under
>>> QEMU.
>>
>> David, I believe this has to do with the discussing we had last Friday.
>>
>> I mentioned that the ibm,chip-id property is being calculated in a way that
>> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes,
>> e.g.:
>>
>> -smp 4,cores=4,maxcpus=8,threads=1 \
>> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \
>> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1
>>
>>
>> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id
>>                      ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>;
>>                      ibm,pft-size = <0x00 0x19>;
>>                      ibm,chip-id = <0x00>;
>> --
>>                       ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>;
>>                       ibm,pft-size = <0x00 0x19>;
>>                       ibm,chip-id = <0x00>;
>> --
>>                       ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>;
>>                       ibm,pft-size = <0x00 0x19>;
>>                       ibm,chip-id = <0x00>;
>> --
>>                       ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>;
>>                       ibm,pft-size = <0x00 0x19>;
>>                       ibm,chip-id = <0x00>;
> 
>> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a
>> different NUMA node than 0-1. This would mean that the same socket
>> would belong to different NUMA nodes at the same time.
> 
> Right... and I'm still not seeing why that's a problem.  AFAICT that's
> a possible, if unexpected, situation under real hardware - though
> maybe not for POWER9 specifically.
The ibm,chip-id property does not exist under PAPR. PAPR only has
NUMA nodes, no sockets nor chips.


And the property value is simply broken under QEMU. Try this  :

   -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G 
-numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object 
memory-backend-ram,id=ram-node1,size=2G -numa 
node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1

# dmesg | grep numa
[    0.013106] numa: Node 0 CPUs: 0-1
[    0.013136] numa: Node 1 CPUs: 2-3

# dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id
                ibm,chip-id = <0x01>;
                ibm,chip-id = <0x02>;
                ibm,chip-id = <0x00>;
                ibm,chip-id = <0x03>;

>> I believe this is what Cedric wants to be addressed. Given that the
>> property is called after the OPAL property ibm,chip-id, the kernel
>> expects that the property will have the same semantics as in OPAL.>
> Even on powernv, I'm not clear why chip-id is tied into the NUMA
> configuration, rather than getting all the NUMA info from
> associativity properties.

It is the case. 

The associativity properties are built from chip-id in OPAL though.

The chip-id property is only used in low level PowerNV drivers, VAS,
XSCOM, LPC, etc. 

It's also badly used in the common part of the XIVE driver, what I am
trying to fix to introduce an IPI per node on all platforms.   

C.

Re: [PATCH 1/2] spapr: number of SMP sockets must be equal to NUMA nodes

Reply via email to