Re: [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory()

Nishanth Aravamudan Tue, 24 Jun 2014 10:03:02 -0700

On 24.06.2014 [16:14:11 +1000], Alexey Kardashevskiy wrote:
> On 06/24/2014 01:08 PM, Nishanth Aravamudan wrote:
> > On 21.06.2014 [13:06:53 +1000], Alexey Kardashevskiy wrote:
> >> On 06/21/2014 08:55 AM, Nishanth Aravamudan wrote:
> >>> On 16.06.2014 [17:53:49 +1000], Alexey Kardashevskiy wrote:
> >>>> Current QEMU does not support memoryless NUMA nodes.
> >>>> This prepares SPAPR for that.
> >>>>
> >>>> This moves 2 calls of spapr_populate_memory_node() into
> >>>> the existing loop which handles nodes other than than
> >>>> the first one.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru>
> >>>> ---
> >>>>  hw/ppc/spapr.c | 31 +++++++++++--------------------
> >>>>  1 file changed, 11 insertions(+), 20 deletions(-)
> >>>>
> >>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>>> index cb3a10a..666b676 100644
> >>>> --- a/hw/ppc/spapr.c
> >>>> +++ b/hw/ppc/spapr.c
> >>>> @@ -689,28 +689,13 @@ static void spapr_populate_memory_node(void *fdt, 
> >>>> int nodeid, hwaddr start,
> >>>>
> >>>>  static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
> >>>>  {
> >>>> -    hwaddr node0_size, mem_start, node_size;
> >>>> +    hwaddr mem_start, node_size;
> >>>>      int i;
> >>>>
> >>>> -    /* memory node(s) */
> >>>> -    if (nb_numa_nodes > 1 && node_mem[0] < ram_size) {
> >>>> -        node0_size = node_mem[0];
> >>>> -    } else {
> >>>> -        node0_size = ram_size;
> >>>> -    }
> >>>> -
> >>>> -    /* RMA */
> >>>> -    spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
> >>>> -
> >>>> -    /* RAM: Node 0 */
> >>>> -    if (node0_size > spapr->rma_size) {
> >>>> -        spapr_populate_memory_node(fdt, 0, spapr->rma_size,
> >>>> -                                   node0_size - spapr->rma_size);
> >>>> -    }
> >>>> -
> >>>> -    /* RAM: Node 1 and beyond */
> >>>> -    mem_start = node0_size;
> >>>> -    for (i = 1; i < nb_numa_nodes; i++) {
> >>>> +    for (i = 0, mem_start = 0; i < nb_numa_nodes; ++i) {
> >>>> +        if (!node_mem[i]) {
> >>>> +            continue;
> >>>> +        }
> >>>
> >>> Doesn't this skip memoryless nodes? What actually puts the memoryless
> >>> node in the device-tree?
> >>
> >> It does skip.
> >>
> >>> And if you were to put them in, wouldn't spapr_populate_memory_node()
> >>> fail because we'd be creating two nodes with memory@XXX where XXX is the
> >>> same (starting address) for both?
> >>
> >> I cannot do this now - there is no way to tell from the command line where
> >> I want NUMA node memory start from so I'll end up with multiple nodes with
> >> the same name and QEMU won't start. When NUMA fixes reach upstream, I'll
> >> try to work out something on top of that.
> > 
> > Ah I got something here. With the patches I just sent to enable sparse
> > NUMA nodes, plus your series rebased on top, here's what I see in a
> > Linux LPAR:
> > 
> > qemu-system-ppc64 -machine pseries,accel=kvm,usb=off -m 4096 -realtime 
> > mlock=off -numa node,nodeid=3,mem=4096,cpus=2-3 -numa 
> > node,nodeid=2,mem=0,cpus=0-1 -smp 4
> > 
> > info numa
> > 2 nodes
> > node 2 cpus: 0 1
> > node 2 size: 0 MB
> > node 3 cpus: 2 3
> > node 3 size: 4096 MB
> > 
> > numactl --hardware
> > available: 3 nodes (0,2-3)
> > node 0 cpus:
> > node 0 size: 0 MB
> > node 0 free: 0 MB
> > node 2 cpus: 0 1
> > node 2 size: 0 MB
> > node 2 free: 0 MB
> > node 3 cpus: 2 3
> > node 3 size: 4073 MB
> > node 3 free: 3830 MB
> > node distances:
> > node   0   2   3 
> >   0:  10  40  40 
> >   2:  40  10  40 
> >   3:  40  40  10 
> > 
> > The trick, it seems, is if you have a memoryless node, it needs to
> > have CPUs assigned to it.
> 
> Yep. The device tree does not have NUMA nodes, it only has CPUs and
> memory@xxx (memory banks?) and the guest kernel has to parse
> ibm,associativity and reconstruct the NUMA topology. If some node is not
> mentioned in any ibm,associativity, it does not exist.


Yep, that all makes sense, but we need something (I think) to handle
this kind of command-line, even if it's just a warning/error:

qemu-system-ppc64 -machine pseries,accel=kvm,usb=off -m
4096 -numa node,nodeid=3,mem=4096,cpus=0-3 -numa
node,nodeid=2,mem=0 -smp 4

info numa
2 nodes
node 2 cpus:
node 2 size: 0 MB
node 3 cpus: 0 1 2 3
node 3 size: 4096 MB

numactl --hardware
available: 2 nodes (0,3)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 3 cpus: 0 1 2 3
node 3 size: 4076 MB
node 3 free: 3864 MB
node distances:
node   0   3 
  0:  10  40 
    3:  40  10 

A pathological case, obviously, but it's pretty trivial to enforce some
sanity here, I think.

> > The CPU's "ibm,associativity" property will
> > make Linux set up the proper NUMA topology.
> > 
> > Thoughts? Should there be a check that every "present" NUMA node at
> > least either has CPUs or memory.
> 
> May be, I'll wait for NUMA stuff in upstream, apply your patch(es), my
> patches and see what I get :)

Ok, sounds good.

> > It seems like if neither are present,
> > we can just hotplug them later?
> 
> hotplug what? NUMA nodes?

Well, this actually existed in practice, IIRC, with SGI's larger boxes
(or was planned at least). But I actually meant when we hotplug in a CPU
or memory later, the appropriate topology should show up. I wonder if
that works, as under PowerVM, those dynamically adjustable hardware are
in in the drconf property, not in memory@ or CPU@ nodes. Ah well, cross
that bridge when we get to it.

> > Does qemu support topology for PCI devices?
> 
> Nope.

Ok, good to know -- as that's another place that can determine what NUMA
nodes are online/offline in Linux, I believe.

Thanks,
Nish

Re: [Qemu-devel] [PATCH 3/7] spapr: Refactor spapr_populate_memory()

Reply via email to