On Fri, 25 Feb 2022 16:41:43 +0800 Gavin Shan <gs...@redhat.com> wrote:
> Hi Igor, > > On 2/17/22 10:14 AM, Gavin Shan wrote: > > On 1/26/22 5:14 PM, Igor Mammedov wrote: > >> On Wed, 26 Jan 2022 13:24:10 +0800 > >> Gavin Shan <gs...@redhat.com> wrote: > >> > >>> The default CPU-to-NUMA association is given by > >>> mc->get_default_cpu_node_id() > >>> when it isn't provided explicitly. However, the CPU topology isn't fully > >>> considered in the default association and it causes CPU topology broken > >>> warnings on booting Linux guest. > >>> > >>> For example, the following warning messages are observed when the Linux > >>> guest > >>> is booted with the following command lines. > >>> > >>> /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ > >>> -accel kvm -machine virt,gic-version=host \ > >>> -cpu host \ > >>> -smp 6,sockets=2,cores=3,threads=1 \ > >>> -m 1024M,slots=16,maxmem=64G \ > >>> -object memory-backend-ram,id=mem0,size=128M \ > >>> -object memory-backend-ram,id=mem1,size=128M \ > >>> -object memory-backend-ram,id=mem2,size=128M \ > >>> -object memory-backend-ram,id=mem3,size=128M \ > >>> -object memory-backend-ram,id=mem4,size=128M \ > >>> -object memory-backend-ram,id=mem4,size=384M \ > >>> -numa node,nodeid=0,memdev=mem0 \ > >>> -numa node,nodeid=1,memdev=mem1 \ > >>> -numa node,nodeid=2,memdev=mem2 \ > >>> -numa node,nodeid=3,memdev=mem3 \ > >>> -numa node,nodeid=4,memdev=mem4 \ > >>> -numa node,nodeid=5,memdev=mem5 > >>> : > >>> alternatives: patching kernel code > >>> BUG: arch topology borken > >>> the CLS domain not a subset of the MC domain > >>> <the above error log repeats> > >>> BUG: arch topology borken > >>> the DIE domain not a subset of the NODE domain > >>> > >>> With current implementation of mc->get_default_cpu_node_id(), CPU#0 to > >>> CPU#5 > >>> are associated with NODE#0 to NODE#5 separately. That's incorrect because > >>> CPU#0/1/2 should be associated with same NUMA node because they're seated > >>> in same socket. > >>> > >>> This fixes the issue by considering the socket when default CPU-to-NUMA > >>> is given. With this applied, no more CPU topology broken warnings are seen > >>> from the Linux guest. The 6 CPUs are associated with NODE#0/1, but there > >>> are > >>> no CPUs associated with NODE#2/3/4/5. > >> > >>> From migration point of view it looks fine to me, and doesn't need a > >>> compat knob > >> since NUMA data (on virt-arm) only used to construct ACPI tables (and we > >> don't > >> version those unless something is broken by it). > >> > >> > >>> Signed-off-by: Gavin Shan <gs...@redhat.com> > >>> --- > >>> hw/arm/virt.c | 2 +- > >>> 1 file changed, 1 insertion(+), 1 deletion(-) > >>> > >>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c > >>> index 141350bf21..b4a95522d3 100644 > >>> --- a/hw/arm/virt.c > >>> +++ b/hw/arm/virt.c > >>> @@ -2499,7 +2499,7 @@ virt_cpu_index_to_props(MachineState *ms, unsigned > >>> cpu_index) > >>> static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int > >>> idx) > >>> { > >>> - return idx % ms->numa_state->num_nodes; > >>> + return idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * > >>> ms->smp.threads); > >> > >> I'd like for ARM folks to confirm whether above is correct > >> (i.e. socket is NUMA node boundary and also if above topo vars > >> could have odd values. Don't look at horribly complicated x86 > >> as example, but it showed that vendors could stash pretty much > >> anything there, so we should consider it here as well and maybe > >> forbid that in smp virt-arm parser) > >> > > > > After doing some investigation, I don't think the socket is NUMA node > > boundary. > > Unfortunately, I didn't find it's documented like this in any documents > > after > > checking device-tree specification, Linux CPU topology and NUMA binding > > documents. > > > > However, there are two options here according to Linux (guest) kernel code: > > (A) socket is NUMA node boundary (B) CPU die is NUMA node boundary. They > > are > > equivalent as CPU die isn't supported on arm/virt machine. Besides, the > > topology > > of one-to-one association between socket and NUMA node sounds natural and > > simplified. > > So I think (A) is the best way to go. > > > > Another thing I want to explain here is how the changes affect the memory > > allocation in Linux guest. Taking the command lines included in the commit > > log as an example, the first two NUMA nodes are bound to CPUs while the > > other > > 4 NUMA nodes are regarded as remote NUMA nodes to CPUs. The remote NUMA node > > won't accommodate the memory allocation until the memory in the near (local) > > NUMA node becomes exhausted. However, it's uncertain how the memory is > > hosted > > if memory binding isn't applied. > > > > Besides, I think the code should be improved like below to avoid overflow on > > ms->numa_state->num_nodes. > > > > static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int > > idx) > > { > > - return idx % ms->numa_state->num_nodes; > > + int node_idx; > > + > > + node_idx = idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * > > ms->smp.threads); > > + return node_idx % ms->numa_state->num_nodes; using idx directly to deduce node looks a bit iffy take x86_get_default_cpu_node_id() as an example, it translates it uses idx to pick arch_id (APIC ID) which has topology encoded into it and than translates that to node boundary (pkg_id -> socket) Probably the same should happen here. PS: may be a little on tangent to the topic but chunk above mentions dies/clusters/cores/threads as possible attributes for CPUs but virt_possible_cpu_arch_ids() says that only has_thread_id = true are supported, which looks broken to me. > > } > > > > > > Kindly ping... > > >>> } > >>> static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState > >>> *ms) > >> > > Thanks, > Gavin >