On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote: > > > On 3/29/21 12:32 PM, Cédric Le Goater wrote: > > On 3/29/21 6:20 AM, David Gibson wrote: > > > On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote: > > > > On 3/25/21 3:10 AM, David Gibson wrote: > > > > > On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza > > > > > wrote: > > > > > > > > > > > > > > > > > > On 3/22/21 10:03 PM, David Gibson wrote: > > > > > > > On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza > > > > > > > wrote: > > > > > > > > Kernel commit 4bce545903fa ("powerpc/topology: Update > > > > > > > > topology_core_cpumask") cause a regression in the pseries > > > > > > > > machine when > > > > > > > > defining certain SMP topologies [1]. The reasoning behind the > > > > > > > > change is > > > > > > > > explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop > > > > > > > > updating > > > > > > > > cpu_core_mask"). In short, cpu_core_mask logic was causing > > > > > > > > troubles with > > > > > > > > large VMs with lots of CPUs and was changed by cpu_cpu_mask > > > > > > > > because, as > > > > > > > > far as the kernel understanding of SMP topologies goes, both > > > > > > > > masks are > > > > > > > > equivalent. > > > > > > > > > > > > > > > > Further discussions in the kernel mailing list [2] shown that > > > > > > > > the > > > > > > > > powerpc kernel always considered that the number of sockets > > > > > > > > were equal > > > > > > > > to the number of NUMA nodes. The claim is that it doesn't make > > > > > > > > sense, > > > > > > > > for Power hardware at least, 2+ sockets being in the same NUMA > > > > > > > > node. The > > > > > > > > immediate conclusion is that all SMP topologies the pseries > > > > > > > > machine were > > > > > > > > supplying to the kernel, with more than one socket in the same > > > > > > > > NUMA node > > > > > > > > as in [1], happened to be correctly represented in the kernel by > > > > > > > > accident during all these years. > > > > > > > > > > > > > > > > There's a case to be made for virtual topologies being detached > > > > > > > > from > > > > > > > > hardware constraints, allowing maximum flexibility to users. At > > > > > > > > the same > > > > > > > > time, this freedom can't result in unrealistic hardware > > > > > > > > representations > > > > > > > > being emulated. If the real hardware and the pseries kernel > > > > > > > > don't > > > > > > > > support multiple chips/sockets in the same NUMA node, neither > > > > > > > > should we. > > > > > > > > > > > > > > > > Starting in 6.0.0, all sockets must match an unique NUMA node > > > > > > > > in the > > > > > > > > pseries machine. qtest changes were made to adapt to this new > > > > > > > > condition. > > > > > > > > > > > > > > Oof. I really don't like this idea. It means a bunch of fiddly > > > > > > > work > > > > > > > for users to match these up, for no real gain. I'm also concerned > > > > > > > that this will require follow on changes in libvirt to not make > > > > > > > this a > > > > > > > really cryptic and irritating point of failure. > > > > > > > > > > > > Haven't though about required Libvirt changes, although I can say > > > > > > that there > > > > > > will be some amount to be mande and it will probably annoy existing > > > > > > users > > > > > > (everyone that has a multiple socket per NUMA node topology). > > > > > > > > > > > > There is not much we can do from the QEMU layer aside from what > > > > > > I've proposed > > > > > > here. The other alternative is to keep interacting with the kernel > > > > > > folks to > > > > > > see if there is a way to keep our use case untouched. > > > > > > > > > > Right. Well.. not necessarily untouched, but I'm hoping for more > > > > > replies from Cédric to my objections and mpe's. Even with sockets > > > > > being a kinda meaningless concept in PAPR, I don't think tying it to > > > > > NUMA nodes makes sense. > > > > > > > > I did a couple of replies in different email threads but maybe not > > > > to all. I felt it was going nowhere :/ Couple of thoughts, > > > > > > I think I saw some of those, but maybe not all. > > > > > > > Shouldn't we get rid of the socket concept, die also, under pseries > > > > since they don't exist under PAPR ? We only have numa nodes, cores, > > > > threads AFAICT. > > > > > > Theoretically, yes. I'm not sure it's really practical, though, since > > > AFAICT, both qemu and the kernel have the notion of sockets (though > > > not dies) built into generic code. > > > > Yes. But, AFAICT, these topology notions have not reached "arch/powerpc" > > and PPC Linux only has a NUMA node id, on pseries and powernv. > > > > > It does mean that one possible approach here - maybe the best one - is > > > to simply declare that sockets are meaningless under, so we simply > > > don't expect what the guest kernel reports to match what's given to > > > qemu. > > > > > > It'd be nice to avoid that if we can: in a sense it's just cosmetic, > > > but it is likely to surprise and confuse people. > > > > > > > Should we diverged from PAPR and add extra DT properties "qemu,..." ? > > > > There are a couple of places where Linux checks for the underlying > > > > hypervisor already. > > > > > > > > > > This also means that > > > > > > 'ibm,chip-id' will probably remain in use since it's the only place > > > > > > where > > > > > > we inform cores per socket information to the kernel. > > > > > > > > > > Well.. unless we can find some other sensible way to convey that > > > > > information. I haven't given up hope for that yet. > > > > > > > > Well, we could start by fixing the value in QEMU. It is broken > > > > today. > > > > > > Fixing what value, exactly? > > > > The value of the "ibm,chip-id" since we are keeping the property under > > QEMU. > > David, I believe this has to do with the discussing we had last Friday. > > I mentioned that the ibm,chip-id property is being calculated in a way that > promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes, > e.g.: > > -smp 4,cores=4,maxcpus=8,threads=1 \ > -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \ > -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 > > > $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id > ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>; > ibm,pft-size = <0x00 0x19>; > ibm,chip-id = <0x00>; > -- > ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>; > ibm,pft-size = <0x00 0x19>; > ibm,chip-id = <0x00>; > -- > ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>; > ibm,pft-size = <0x00 0x19>; > ibm,chip-id = <0x00>; > -- > ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>; > ibm,pft-size = <0x00 0x19>; > ibm,chip-id = <0x00>;
> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a > different NUMA node than 0-1. This would mean that the same socket > would belong to different NUMA nodes at the same time. Right... and I'm still not seeing why that's a problem. AFAICT that's a possible, if unexpected, situation under real hardware - though maybe not for POWER9 specifically. > I believe this is what Cedric wants to be addressed. Given that the > property is called after the OPAL property ibm,chip-id, the kernel > expects that the property will have the same semantics as in OPAL. Even on powernv, I'm not clear why chip-id is tied into the NUMA configuration, rather than getting all the NUMA info from associativity properties. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature