On 24.06.2014 [16:14:11 +1000], Alexey Kardashevskiy wrote: > On 06/24/2014 01:08 PM, Nishanth Aravamudan wrote: > > On 21.06.2014 [13:06:53 +1000], Alexey Kardashevskiy wrote: > >> On 06/21/2014 08:55 AM, Nishanth Aravamudan wrote: > >>> On 16.06.2014 [17:53:49 +1000], Alexey Kardashevskiy wrote: > >>>> Current QEMU does not support memoryless NUMA nodes. > >>>> This prepares SPAPR for that. > >>>> > >>>> This moves 2 calls of spapr_populate_memory_node() into > >>>> the existing loop which handles nodes other than than > >>>> the first one. > >>>> > >>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> > >>>> --- > >>>> hw/ppc/spapr.c | 31 +++++++++++-------------------- > >>>> 1 file changed, 11 insertions(+), 20 deletions(-) > >>>> > >>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > >>>> index cb3a10a..666b676 100644 > >>>> --- a/hw/ppc/spapr.c > >>>> +++ b/hw/ppc/spapr.c > >>>> @@ -689,28 +689,13 @@ static void spapr_populate_memory_node(void *fdt, > >>>> int nodeid, hwaddr start, > >>>> > >>>> static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt) > >>>> { > >>>> - hwaddr node0_size, mem_start, node_size; > >>>> + hwaddr mem_start, node_size; > >>>> int i; > >>>> > >>>> - /* memory node(s) */ > >>>> - if (nb_numa_nodes > 1 && node_mem[0] < ram_size) { > >>>> - node0_size = node_mem[0]; > >>>> - } else { > >>>> - node0_size = ram_size; > >>>> - } > >>>> - > >>>> - /* RMA */ > >>>> - spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size); > >>>> - > >>>> - /* RAM: Node 0 */ > >>>> - if (node0_size > spapr->rma_size) { > >>>> - spapr_populate_memory_node(fdt, 0, spapr->rma_size, > >>>> - node0_size - spapr->rma_size); > >>>> - } > >>>> - > >>>> - /* RAM: Node 1 and beyond */ > >>>> - mem_start = node0_size; > >>>> - for (i = 1; i < nb_numa_nodes; i++) { > >>>> + for (i = 0, mem_start = 0; i < nb_numa_nodes; ++i) { > >>>> + if (!node_mem[i]) { > >>>> + continue; > >>>> + } > >>> > >>> Doesn't this skip memoryless nodes? What actually puts the memoryless > >>> node in the device-tree? > >> > >> It does skip. > >> > >>> And if you were to put them in, wouldn't spapr_populate_memory_node() > >>> fail because we'd be creating two nodes with memory@XXX where XXX is the > >>> same (starting address) for both? > >> > >> I cannot do this now - there is no way to tell from the command line where > >> I want NUMA node memory start from so I'll end up with multiple nodes with > >> the same name and QEMU won't start. When NUMA fixes reach upstream, I'll > >> try to work out something on top of that. > > > > Ah I got something here. With the patches I just sent to enable sparse > > NUMA nodes, plus your series rebased on top, here's what I see in a > > Linux LPAR: > > > > qemu-system-ppc64 -machine pseries,accel=kvm,usb=off -m 4096 -realtime > > mlock=off -numa node,nodeid=3,mem=4096,cpus=2-3 -numa > > node,nodeid=2,mem=0,cpus=0-1 -smp 4 > > > > info numa > > 2 nodes > > node 2 cpus: 0 1 > > node 2 size: 0 MB > > node 3 cpus: 2 3 > > node 3 size: 4096 MB > > > > numactl --hardware > > available: 3 nodes (0,2-3) > > node 0 cpus: > > node 0 size: 0 MB > > node 0 free: 0 MB > > node 2 cpus: 0 1 > > node 2 size: 0 MB > > node 2 free: 0 MB > > node 3 cpus: 2 3 > > node 3 size: 4073 MB > > node 3 free: 3830 MB > > node distances: > > node 0 2 3 > > 0: 10 40 40 > > 2: 40 10 40 > > 3: 40 40 10 > > > > The trick, it seems, is if you have a memoryless node, it needs to > > have CPUs assigned to it. > > Yep. The device tree does not have NUMA nodes, it only has CPUs and > memory@xxx (memory banks?) and the guest kernel has to parse > ibm,associativity and reconstruct the NUMA topology. If some node is not > mentioned in any ibm,associativity, it does not exist.
Yep, that all makes sense, but we need something (I think) to handle this kind of command-line, even if it's just a warning/error: qemu-system-ppc64 -machine pseries,accel=kvm,usb=off -m 4096 -numa node,nodeid=3,mem=4096,cpus=0-3 -numa node,nodeid=2,mem=0 -smp 4 info numa 2 nodes node 2 cpus: node 2 size: 0 MB node 3 cpus: 0 1 2 3 node 3 size: 4096 MB numactl --hardware available: 2 nodes (0,3) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 3 cpus: 0 1 2 3 node 3 size: 4076 MB node 3 free: 3864 MB node distances: node 0 3 0: 10 40 3: 40 10 A pathological case, obviously, but it's pretty trivial to enforce some sanity here, I think. > > The CPU's "ibm,associativity" property will > > make Linux set up the proper NUMA topology. > > > > Thoughts? Should there be a check that every "present" NUMA node at > > least either has CPUs or memory. > > May be, I'll wait for NUMA stuff in upstream, apply your patch(es), my > patches and see what I get :) Ok, sounds good. > > It seems like if neither are present, > > we can just hotplug them later? > > hotplug what? NUMA nodes? Well, this actually existed in practice, IIRC, with SGI's larger boxes (or was planned at least). But I actually meant when we hotplug in a CPU or memory later, the appropriate topology should show up. I wonder if that works, as under PowerVM, those dynamically adjustable hardware are in in the drconf property, not in memory@ or CPU@ nodes. Ah well, cross that bridge when we get to it. > > Does qemu support topology for PCI devices? > > Nope. Ok, good to know -- as that's another place that can determine what NUMA nodes are online/offline in Linux, I believe. Thanks, Nish