On Mon, 29 Mar 2021 15:32:37 -0300 Daniel Henrique Barboza <danielhb...@gmail.com> wrote:
> On 3/29/21 12:32 PM, Cédric Le Goater wrote: > > On 3/29/21 6:20 AM, David Gibson wrote: > >> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote: > >>> On 3/25/21 3:10 AM, David Gibson wrote: > >>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote: > >>>> > >>>>> > >>>>> > >>>>> On 3/22/21 10:03 PM, David Gibson wrote: > >>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza > >>>>>> wrote: > >>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update > >>>>>>> topology_core_cpumask") cause a regression in the pseries machine when > >>>>>>> defining certain SMP topologies [1]. The reasoning behind the change > >>>>>>> is > >>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating > >>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles > >>>>>>> with > >>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, > >>>>>>> as > >>>>>>> far as the kernel understanding of SMP topologies goes, both masks are > >>>>>>> equivalent. > >>>>>>> > >>>>>>> Further discussions in the kernel mailing list [2] shown that the > >>>>>>> powerpc kernel always considered that the number of sockets were equal > >>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense, > >>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. > >>>>>>> The > >>>>>>> immediate conclusion is that all SMP topologies the pseries machine > >>>>>>> were > >>>>>>> supplying to the kernel, with more than one socket in the same NUMA > >>>>>>> node > >>>>>>> as in [1], happened to be correctly represented in the kernel by > >>>>>>> accident during all these years. > >>>>>>> > >>>>>>> There's a case to be made for virtual topologies being detached from > >>>>>>> hardware constraints, allowing maximum flexibility to users. At the > >>>>>>> same > >>>>>>> time, this freedom can't result in unrealistic hardware > >>>>>>> representations > >>>>>>> being emulated. If the real hardware and the pseries kernel don't > >>>>>>> support multiple chips/sockets in the same NUMA node, neither should > >>>>>>> we. > >>>>>>> > >>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the > >>>>>>> pseries machine. qtest changes were made to adapt to this new > >>>>>>> condition. > >>>>>> > >>>>>> Oof. I really don't like this idea. It means a bunch of fiddly work > >>>>>> for users to match these up, for no real gain. I'm also concerned > >>>>>> that this will require follow on changes in libvirt to not make this a > >>>>>> really cryptic and irritating point of failure. > >>>>> > >>>>> Haven't though about required Libvirt changes, although I can say that > >>>>> there > >>>>> will be some amount to be mande and it will probably annoy existing > >>>>> users > >>>>> (everyone that has a multiple socket per NUMA node topology). > >>>>> > >>>>> There is not much we can do from the QEMU layer aside from what I've > >>>>> proposed > >>>>> here. The other alternative is to keep interacting with the kernel > >>>>> folks to > >>>>> see if there is a way to keep our use case untouched. > >>>> > >>>> Right. Well.. not necessarily untouched, but I'm hoping for more > >>>> replies from Cédric to my objections and mpe's. Even with sockets > >>>> being a kinda meaningless concept in PAPR, I don't think tying it to > >>>> NUMA nodes makes sense. > >>> > >>> I did a couple of replies in different email threads but maybe not > >>> to all. I felt it was going nowhere :/ Couple of thoughts, > >> > >> I think I saw some of those, but maybe not all. > >> > >>> Shouldn't we get rid of the socket concept, die also, under pseries > >>> since they don't exist under PAPR ? We only have numa nodes, cores, > >>> threads AFAICT. > >> > >> Theoretically, yes. I'm not sure it's really practical, though, since > >> AFAICT, both qemu and the kernel have the notion of sockets (though > >> not dies) built into generic code. > > > > Yes. But, AFAICT, these topology notions have not reached "arch/powerpc" > > and PPC Linux only has a NUMA node id, on pseries and powernv. > > > >> It does mean that one possible approach here - maybe the best one - is > >> to simply declare that sockets are meaningless under, so we simply > >> don't expect what the guest kernel reports to match what's given to > >> qemu. > >> > >> It'd be nice to avoid that if we can: in a sense it's just cosmetic, > >> but it is likely to surprise and confuse people. > >> > >>> Should we diverged from PAPR and add extra DT properties "qemu,..." ? > >>> There are a couple of places where Linux checks for the underlying > >>> hypervisor already. > >>> > >>>>> This also means that > >>>>> 'ibm,chip-id' will probably remain in use since it's the only place > >>>>> where > >>>>> we inform cores per socket information to the kernel. > >>>> > >>>> Well.. unless we can find some other sensible way to convey that > >>>> information. I haven't given up hope for that yet. > >>> > >>> Well, we could start by fixing the value in QEMU. It is broken > >>> today. > >> > >> Fixing what value, exactly? > > > > The value of the "ibm,chip-id" since we are keeping the property under > > QEMU. > > David, I believe this has to do with the discussing we had last Friday. > > I mentioned that the ibm,chip-id property is being calculated in a way that > promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes, > e.g.: > > -smp 4,cores=4,maxcpus=8,threads=1 \ > -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \ > -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 just heads up, 'cpus=' is going away (deprecation patch got lost during 6.0 but I plan on pushing it during 6.1 devel window), you can use '-numa cpus,node-id=a,core-id=b' instead > > > $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id > ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>; > ibm,pft-size = <0x00 0x19>; > ibm,chip-id = <0x00>; > -- > ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>; > ibm,pft-size = <0x00 0x19>; > ibm,chip-id = <0x00>; > -- > ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>; > ibm,pft-size = <0x00 0x19>; > ibm,chip-id = <0x00>; > -- > ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>; > ibm,pft-size = <0x00 0x19>; > ibm,chip-id = <0x00>; > > We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a different > NUMA node than 0-1. This would mean that the same socket would belong to > different NUMA nodes at the same time. > > I believe this is what Cedric wants to be addressed. Given that the property > is > called after the OPAL property ibm,chip-id, the kernel expects that the > property > will have the same semantics as in OPAL. > > > > Thanks, > > > DHB > > > > > > > >>> This is all coming from some work we did last year to evaluate our HW > >>> (mostly for XIVE) on 2s, 4s, 16s systems on baremetal, KVM and PowerVM. > >>> We saw some real problems because Linux did not have a clear view of the > >>> topology. See the figures here : > >>> > >>> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210303174857.1760393-9-...@kaod.org/ > >>> > >>> The node id is a key parameter for system resource management, memory > >>> allocation, interrupt affinity, etc. Linux scales much better if used > >>> correctly. > >> > >> Well, sure. And we have all the ibm,associativity stuff to convey the > >> node ids to the guest (which has its own problems, but not that are > >> relevant here). What's throwing me is why getting node IDs correct > >> has anything to do with socket numbers. > > > > >