On 3/31/21 2:57 AM, David Gibson wrote: > On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote: >> >> >> On 3/29/21 12:32 PM, Cédric Le Goater wrote: >>> On 3/29/21 6:20 AM, David Gibson wrote: >>>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote: >>>>> On 3/25/21 3:10 AM, David Gibson wrote: >>>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote: >>>>>>> >>>>>>> >>>>>>> On 3/22/21 10:03 PM, David Gibson wrote: >>>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza >>>>>>>> wrote: >>>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update >>>>>>>>> topology_core_cpumask") cause a regression in the pseries machine when >>>>>>>>> defining certain SMP topologies [1]. The reasoning behind the change >>>>>>>>> is >>>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating >>>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles >>>>>>>>> with >>>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, >>>>>>>>> as >>>>>>>>> far as the kernel understanding of SMP topologies goes, both masks are >>>>>>>>> equivalent. >>>>>>>>> >>>>>>>>> Further discussions in the kernel mailing list [2] shown that the >>>>>>>>> powerpc kernel always considered that the number of sockets were equal >>>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense, >>>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. >>>>>>>>> The >>>>>>>>> immediate conclusion is that all SMP topologies the pseries machine >>>>>>>>> were >>>>>>>>> supplying to the kernel, with more than one socket in the same NUMA >>>>>>>>> node >>>>>>>>> as in [1], happened to be correctly represented in the kernel by >>>>>>>>> accident during all these years. >>>>>>>>> >>>>>>>>> There's a case to be made for virtual topologies being detached from >>>>>>>>> hardware constraints, allowing maximum flexibility to users. At the >>>>>>>>> same >>>>>>>>> time, this freedom can't result in unrealistic hardware >>>>>>>>> representations >>>>>>>>> being emulated. If the real hardware and the pseries kernel don't >>>>>>>>> support multiple chips/sockets in the same NUMA node, neither should >>>>>>>>> we. >>>>>>>>> >>>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the >>>>>>>>> pseries machine. qtest changes were made to adapt to this new >>>>>>>>> condition. >>>>>>>> >>>>>>>> Oof. I really don't like this idea. It means a bunch of fiddly work >>>>>>>> for users to match these up, for no real gain. I'm also concerned >>>>>>>> that this will require follow on changes in libvirt to not make this a >>>>>>>> really cryptic and irritating point of failure. >>>>>>> >>>>>>> Haven't though about required Libvirt changes, although I can say that >>>>>>> there >>>>>>> will be some amount to be mande and it will probably annoy existing >>>>>>> users >>>>>>> (everyone that has a multiple socket per NUMA node topology). >>>>>>> >>>>>>> There is not much we can do from the QEMU layer aside from what I've >>>>>>> proposed >>>>>>> here. The other alternative is to keep interacting with the kernel >>>>>>> folks to >>>>>>> see if there is a way to keep our use case untouched. >>>>>> >>>>>> Right. Well.. not necessarily untouched, but I'm hoping for more >>>>>> replies from Cédric to my objections and mpe's. Even with sockets >>>>>> being a kinda meaningless concept in PAPR, I don't think tying it to >>>>>> NUMA nodes makes sense. >>>>> >>>>> I did a couple of replies in different email threads but maybe not >>>>> to all. I felt it was going nowhere :/ Couple of thoughts, >>>> >>>> I think I saw some of those, but maybe not all. >>>> >>>>> Shouldn't we get rid of the socket concept, die also, under pseries >>>>> since they don't exist under PAPR ? We only have numa nodes, cores, >>>>> threads AFAICT. >>>> >>>> Theoretically, yes. I'm not sure it's really practical, though, since >>>> AFAICT, both qemu and the kernel have the notion of sockets (though >>>> not dies) built into generic code. >>> >>> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc" >>> and PPC Linux only has a NUMA node id, on pseries and powernv. >>> >>>> It does mean that one possible approach here - maybe the best one - is >>>> to simply declare that sockets are meaningless under, so we simply >>>> don't expect what the guest kernel reports to match what's given to >>>> qemu. >>>> >>>> It'd be nice to avoid that if we can: in a sense it's just cosmetic, >>>> but it is likely to surprise and confuse people. >>>> >>>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ? >>>>> There are a couple of places where Linux checks for the underlying >>>>> hypervisor already. >>>>> >>>>>>> This also means that >>>>>>> 'ibm,chip-id' will probably remain in use since it's the only place >>>>>>> where >>>>>>> we inform cores per socket information to the kernel. >>>>>> >>>>>> Well.. unless we can find some other sensible way to convey that >>>>>> information. I haven't given up hope for that yet. >>>>> >>>>> Well, we could start by fixing the value in QEMU. It is broken >>>>> today. >>>> >>>> Fixing what value, exactly? >>> >>> The value of the "ibm,chip-id" since we are keeping the property under >>> QEMU. >> >> David, I believe this has to do with the discussing we had last Friday. >> >> I mentioned that the ibm,chip-id property is being calculated in a way that >> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes, >> e.g.: >> >> -smp 4,cores=4,maxcpus=8,threads=1 \ >> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \ >> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >> >> >> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id >> ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>; >> ibm,pft-size = <0x00 0x19>; >> ibm,chip-id = <0x00>; >> -- >> ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>; >> ibm,pft-size = <0x00 0x19>; >> ibm,chip-id = <0x00>; >> -- >> ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>; >> ibm,pft-size = <0x00 0x19>; >> ibm,chip-id = <0x00>; >> -- >> ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>; >> ibm,pft-size = <0x00 0x19>; >> ibm,chip-id = <0x00>; > >> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a >> different NUMA node than 0-1. This would mean that the same socket >> would belong to different NUMA nodes at the same time. > > Right... and I'm still not seeing why that's a problem. AFAICT that's > a possible, if unexpected, situation under real hardware - though > maybe not for POWER9 specifically. The ibm,chip-id property does not exist under PAPR. PAPR only has NUMA nodes, no sockets nor chips.
And the property value is simply broken under QEMU. Try this : -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 # dmesg | grep numa [ 0.013106] numa: Node 0 CPUs: 0-1 [ 0.013136] numa: Node 1 CPUs: 2-3 # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id ibm,chip-id = <0x01>; ibm,chip-id = <0x02>; ibm,chip-id = <0x00>; ibm,chip-id = <0x03>; >> I believe this is what Cedric wants to be addressed. Given that the >> property is called after the OPAL property ibm,chip-id, the kernel >> expects that the property will have the same semantics as in OPAL.> > Even on powernv, I'm not clear why chip-id is tied into the NUMA > configuration, rather than getting all the NUMA info from > associativity properties. It is the case. The associativity properties are built from chip-id in OPAL though. The chip-id property is only used in low level PowerNV drivers, VAS, XSCOM, LPC, etc. It's also badly used in the common part of the XIVE driver, what I am trying to fix to introduce an IPI per node on all platforms. C.