On 06/24/2014 01:08 PM, Nishanth Aravamudan wrote: > On 21.06.2014 [13:06:53 +1000], Alexey Kardashevskiy wrote: >> On 06/21/2014 08:55 AM, Nishanth Aravamudan wrote: >>> On 16.06.2014 [17:53:49 +1000], Alexey Kardashevskiy wrote: >>>> Current QEMU does not support memoryless NUMA nodes. >>>> This prepares SPAPR for that. >>>> >>>> This moves 2 calls of spapr_populate_memory_node() into >>>> the existing loop which handles nodes other than than >>>> the first one. >>>> >>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> >>>> --- >>>> hw/ppc/spapr.c | 31 +++++++++++-------------------- >>>> 1 file changed, 11 insertions(+), 20 deletions(-) >>>> >>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c >>>> index cb3a10a..666b676 100644 >>>> --- a/hw/ppc/spapr.c >>>> +++ b/hw/ppc/spapr.c >>>> @@ -689,28 +689,13 @@ static void spapr_populate_memory_node(void *fdt, >>>> int nodeid, hwaddr start, >>>> >>>> static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt) >>>> { >>>> - hwaddr node0_size, mem_start, node_size; >>>> + hwaddr mem_start, node_size; >>>> int i; >>>> >>>> - /* memory node(s) */ >>>> - if (nb_numa_nodes > 1 && node_mem[0] < ram_size) { >>>> - node0_size = node_mem[0]; >>>> - } else { >>>> - node0_size = ram_size; >>>> - } >>>> - >>>> - /* RMA */ >>>> - spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size); >>>> - >>>> - /* RAM: Node 0 */ >>>> - if (node0_size > spapr->rma_size) { >>>> - spapr_populate_memory_node(fdt, 0, spapr->rma_size, >>>> - node0_size - spapr->rma_size); >>>> - } >>>> - >>>> - /* RAM: Node 1 and beyond */ >>>> - mem_start = node0_size; >>>> - for (i = 1; i < nb_numa_nodes; i++) { >>>> + for (i = 0, mem_start = 0; i < nb_numa_nodes; ++i) { >>>> + if (!node_mem[i]) { >>>> + continue; >>>> + } >>> >>> Doesn't this skip memoryless nodes? What actually puts the memoryless >>> node in the device-tree? >> >> It does skip. >> >>> And if you were to put them in, wouldn't spapr_populate_memory_node() >>> fail because we'd be creating two nodes with memory@XXX where XXX is the >>> same (starting address) for both? >> >> I cannot do this now - there is no way to tell from the command line where >> I want NUMA node memory start from so I'll end up with multiple nodes with >> the same name and QEMU won't start. When NUMA fixes reach upstream, I'll >> try to work out something on top of that. > > Ah I got something here. With the patches I just sent to enable sparse > NUMA nodes, plus your series rebased on top, here's what I see in a > Linux LPAR: > > qemu-system-ppc64 -machine pseries,accel=kvm,usb=off -m 4096 -realtime > mlock=off -numa node,nodeid=3,mem=4096,cpus=2-3 -numa > node,nodeid=2,mem=0,cpus=0-1 -smp 4 > > info numa > 2 nodes > node 2 cpus: 0 1 > node 2 size: 0 MB > node 3 cpus: 2 3 > node 3 size: 4096 MB > > numactl --hardware > available: 3 nodes (0,2-3) > node 0 cpus: > node 0 size: 0 MB > node 0 free: 0 MB > node 2 cpus: 0 1 > node 2 size: 0 MB > node 2 free: 0 MB > node 3 cpus: 2 3 > node 3 size: 4073 MB > node 3 free: 3830 MB > node distances: > node 0 2 3 > 0: 10 40 40 > 2: 40 10 40 > 3: 40 40 10 > > The trick, it seems, is if you have a memoryless node, it needs to > have CPUs assigned to it.
Yep. The device tree does not have NUMA nodes, it only has CPUs and memory@xxx (memory banks?) and the guest kernel has to parse ibm,associativity and reconstruct the NUMA topology. If some node is not mentioned in any ibm,associativity, it does not exist. > The CPU's "ibm,associativity" property will > make Linux set up the proper NUMA topology. > > Thoughts? Should there be a check that every "present" NUMA node at > least either has CPUs or memory. May be, I'll wait for NUMA stuff in upstream, apply your patch(es), my patches and see what I get :) > It seems like if neither are present, > we can just hotplug them later? hotplug what? NUMA nodes? > Does qemu support topology for PCI devices? Nope. -- Alexey