On 3/29/21 6:20 AM, David Gibson wrote: > On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote: >> On 3/25/21 3:10 AM, David Gibson wrote: >>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote: >>>> >>>> >>>> On 3/22/21 10:03 PM, David Gibson wrote: >>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: >>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update >>>>>> topology_core_cpumask") cause a regression in the pseries machine when >>>>>> defining certain SMP topologies [1]. The reasoning behind the change is >>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating >>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with >>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as >>>>>> far as the kernel understanding of SMP topologies goes, both masks are >>>>>> equivalent. >>>>>> >>>>>> Further discussions in the kernel mailing list [2] shown that the >>>>>> powerpc kernel always considered that the number of sockets were equal >>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense, >>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The >>>>>> immediate conclusion is that all SMP topologies the pseries machine were >>>>>> supplying to the kernel, with more than one socket in the same NUMA node >>>>>> as in [1], happened to be correctly represented in the kernel by >>>>>> accident during all these years. >>>>>> >>>>>> There's a case to be made for virtual topologies being detached from >>>>>> hardware constraints, allowing maximum flexibility to users. At the same >>>>>> time, this freedom can't result in unrealistic hardware representations >>>>>> being emulated. If the real hardware and the pseries kernel don't >>>>>> support multiple chips/sockets in the same NUMA node, neither should we. >>>>>> >>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the >>>>>> pseries machine. qtest changes were made to adapt to this new >>>>>> condition. >>>>> >>>>> Oof. I really don't like this idea. It means a bunch of fiddly work >>>>> for users to match these up, for no real gain. I'm also concerned >>>>> that this will require follow on changes in libvirt to not make this a >>>>> really cryptic and irritating point of failure. >>>> >>>> Haven't though about required Libvirt changes, although I can say that >>>> there >>>> will be some amount to be mande and it will probably annoy existing users >>>> (everyone that has a multiple socket per NUMA node topology). >>>> >>>> There is not much we can do from the QEMU layer aside from what I've >>>> proposed >>>> here. The other alternative is to keep interacting with the kernel folks to >>>> see if there is a way to keep our use case untouched. >>> >>> Right. Well.. not necessarily untouched, but I'm hoping for more >>> replies from Cédric to my objections and mpe's. Even with sockets >>> being a kinda meaningless concept in PAPR, I don't think tying it to >>> NUMA nodes makes sense. >> >> I did a couple of replies in different email threads but maybe not >> to all. I felt it was going nowhere :/ Couple of thoughts, > > I think I saw some of those, but maybe not all. > >> Shouldn't we get rid of the socket concept, die also, under pseries >> since they don't exist under PAPR ? We only have numa nodes, cores, >> threads AFAICT. > > Theoretically, yes. I'm not sure it's really practical, though, since > AFAICT, both qemu and the kernel have the notion of sockets (though > not dies) built into generic code.
Yes. But, AFAICT, these topology notions have not reached "arch/powerpc" and PPC Linux only has a NUMA node id, on pseries and powernv. > It does mean that one possible approach here - maybe the best one - is > to simply declare that sockets are meaningless under, so we simply > don't expect what the guest kernel reports to match what's given to > qemu. > > It'd be nice to avoid that if we can: in a sense it's just cosmetic, > but it is likely to surprise and confuse people. > >> Should we diverged from PAPR and add extra DT properties "qemu,..." ? >> There are a couple of places where Linux checks for the underlying >> hypervisor already. >> >>>> This also means that >>>> 'ibm,chip-id' will probably remain in use since it's the only place where >>>> we inform cores per socket information to the kernel. >>> >>> Well.. unless we can find some other sensible way to convey that >>> information. I haven't given up hope for that yet. >> >> Well, we could start by fixing the value in QEMU. It is broken >> today. > > Fixing what value, exactly? The value of the "ibm,chip-id" since we are keeping the property under QEMU. >> This is all coming from some work we did last year to evaluate our HW >> (mostly for XIVE) on 2s, 4s, 16s systems on baremetal, KVM and PowerVM. >> We saw some real problems because Linux did not have a clear view of the >> topology. See the figures here : >> >> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210303174857.1760393-9-...@kaod.org/ >> >> The node id is a key parameter for system resource management, memory >> allocation, interrupt affinity, etc. Linux scales much better if used >> correctly. > > Well, sure. And we have all the ibm,associativity stuff to convey the > node ids to the guest (which has its own problems, but not that are > relevant here). What's throwing me is why getting node IDs correct > has anything to do with socket numbers.