Re: [Qemu-devel] [RFC PATCH v0] numa: API to lookup NUMA node by address

Eduardo Habkost Wed, 10 Jun 2015 05:15:58 -0700

On Wed, Jun 10, 2015 at 11:43:19AM +0200, Igor Mammedov wrote:
> On Tue, 9 Jun 2015 09:40:54 -0300
> Eduardo Habkost <ehabk...@redhat.com> wrote:
> 
> > On Tue, Jun 09, 2015 at 11:23:19AM +0200, Igor Mammedov wrote:
> > > On Mon, 8 Jun 2015 12:51:39 -0300
> > > Eduardo Habkost <ehabk...@redhat.com> wrote:
> > > 
> > > > On Mon, Jun 08, 2015 at 11:51:03AM +0200, Igor Mammedov wrote:
> > > > > On Mon, 8 Jun 2015 11:28:18 +0530
> > > > > Bharata B Rao <bhar...@linux.vnet.ibm.com> wrote:
> > > > > 
> > > > > > On Mon, May 25, 2015 at 02:42:40PM -0300, Eduardo Habkost wrote:
> > > > > > > On Mon, May 25, 2015 at 01:17:57PM +0530, Bharata B Rao wrote:
> > > > > > > > On Thu, May 14, 2015 at 11:39:06AM +0200, Paolo Bonzini wrote:
> > > > > > > > > On 13/05/2015 20:06, Eduardo Habkost wrote:
> > > > > > > > > > Also, this introduces a circular dependency between 
> > > > > > > > > > pc-dimm.c and
> > > > > > > > > > numa.c. Instead of that, pc-dimm could simply notify us 
> > > > > > > > > > when a new
> > > > > > > > > > device is realized (with just (addr, end, node) as 
> > > > > > > > > > arguments), so we can
> > > > > > > > > > save the list of memory ranges inside struct node_info.
> > > > > > > > > > 
> > > > > > > > > > I wonder if the memory API already provides something that 
> > > > > > > > > > would help
> > > > > > > > > > us. Paolo, do you see a way we could simply use a 
> > > > > > > > > > MemoryRegion as input
> > > > > > > > > > to lookup the NUMA node?
> > > > > > > > > 
> > > > > > > > > No, but I guess you could add a 
> > > > > > > > > numa_get/set_memory_region_node_id API
> > > > > > > > > that uses a hash table.  That's a variant of the "pc-dimm 
> > > > > > > > > could simply
> > > > > > > > > notify" numa.c that you propose above.
> > > > > > > > 
> > > > > > > > While you say we can't use MemoryRegion as input to lookup the 
> > > > > > > > NUMA node,
> > > > > > > > you suggest that we add numa_get/set_memory_region_node_id. 
> > > > > > > > Does this API
> > > > > > > > get/set NUMA node id for the given MemoryRegion ? 
> > > > > > > 
> > > > > > > I was going to suggest that, but it would require changing the
> > > > > > > non-memdev code path to create a MemoryRegion for each node, too. 
> > > > > > > So
> > > > > > > having a numa_set_mem_node_id(start_addr, end_addr, node_id) API 
> > > > > > > would
> > > > > > > be simpler.
> > > > > > 
> > > > > > In order to save the list of memory ranges inside node_info, I 
> > > > > > tried this
> > > > > > approach where I call
> > > > > > 
> > > > > > numa_set_mem_node_id(dimm.addr, dimm.size, dimm.node) from
> > > > > > 
> > > > > > pc_dimm_realize(), but
> > > > > > 
> > > > > > the value of dimm.addr is finalized only later in ->plug().
> > > > > > 
> > > > > > So we would have to call this API from arch code like 
> > > > > > pc_dimm_plug().
> > > > > > Is that acceptable ?
> > > > 
> > > > It looks acceptable to me, as pc.c already has all the rest of the
> > > > NUMA-specific code for PC. I believe it would be interesting to keep all
> > > > numa.o dependencies contained inside machine code.
> > > > 
> > > > > Could you query pc_dimms' numa property each time you need mapping
> > > > > instead of additionally storing that mapping elsewhere?
> > > > 
> > > > The original patch did that, but I suggested the
> > > > numa_set_mem_node_id() API for two reasons: 1) not requiring special
> > > > cases for hotplug inside numa_get_node(); 2) not introducing a circular
> > > > dependency between pc-dimm.c and numa.c.
> > > What circular dependency doing foreach(pc-dimm) would introduce?
> > > So far pc-dimm is independent from numa.c and a regular device with no
> > > dependencies (except of on backend memdev) and has it's own 'numa' 
> > > property
> > > for providing that information to interested users. I'd rather keep
> > > it separate form legacy numa.c:-numa handling.
> > 
> > pc-dimm.c already depends on numa.c because it checks nb_numa_nodes
> > inside pc_dimm_realize().
> check should be in pc_dimm_plug() instead of realize but I guess it saves
> up duplication when pc-dimm reused with other targets, anyway we could move
> it out into generic common function.


Agreed.

> 
> > 
> > I don't understand what you mean by "legacy numa.c:-numa handling".
> > Unless there's a way to query pc-dimm (or other code) for all
> > (address -> numa_node) mappings without a special case for memory
> > hotplug[1], I wouldn't call node_info[].node_mem "legacy".
> > 
> > [1] And this is exactly what I want to provide with
> >     numa_set_mem_node_id(): an API that doesn't require special cases
> >     for memory hotplug.
> For x86 board makers usually define address ranges -> node mapping statically.
> So node_info[].node_mem  & numa_set_mem_node_id() makes sense.
> And sPAPR could probably do the same, no need to scan for pc-dimm devices.

OK, so we are on the same page now.

> 
>  
> [...]
> > > What for one needs to pull dimm properties into numa.c?
> > 
> > I am not sure I parsed the question correctly, but:
> > 
> > Original commit message explains why numa_get_node(addr) is needed:
> > 
> > > > This is needed by sPAPR PowerPC to support
> > > > ibm,dynamic-reconfiguration-memory device tree node which is needed
> > > > for memory hotplug.
> > 
> > And to make this work, it needs to be aware of NUMA information for
> > hotplugged memory too.
> I've checked spapr_populate_drconf_memory() from original series,
> it needs to be aware at startup about address ranges -> node mapping
> including mapping partitioning of whole hotplug memory range
> (i.e. not actual hotplugged memory).
> -numa node_mem  & numa_set_mem_node_id() are sufficient for this purpose 

Good. :)

> 
> > My proposal is to do that inside the code that actually assigns
> > addresses to pc-dimm: pc_dimm_realize() (pc.c). So pc-dimm.c and numa.c
> > won't reference each other, and numa.c won't need any special code for
> > hotplug.
> you've probably meant arc_dimm_plug() instead of pc_dimm_realize().
> but it shouldn't call numa_set_mem_node_id() since partitioning is static
> and is done at startup time. 

Sorry, I meant pc_dimm_plug().

I didn't know partitioning was static. I thought it was dynamic and
defined by the "node" property on pc-dimm. (But I am confused by what
you say below).


> 
> > We could have common helpers later to avoid duplicating the same logic
> > into other machines, but the point is that we don't need to make numa.c
> > carry special memory hotplug code, and we don't need to make pc-dimm.c
> > care about -numa.
> agreed.
> 
> We don't have static partitioning of hotplug memory address space in
> x86 target because it limits flexibility of hot-plugging any amount
> of memory to any node. (but we could both a static partitioning using
> SRAT table and override it _PXM method for dynamic assignment).
> 
> The issue with static partitioning is that mgmt tools have to know
> addr & size of hotplug memory address space to partition it, but
> QEMU doesn't provide that info so far.

OK, so we agree about the numa_set_mem_node_id() API, but I am confused
by what you said about static partitioning. Didn't you just say above
that partitioning is static and done at startup time?

> 
> > 
> > >  
> > > > Having a numa_set_memory_region_node_id(MemoryRegion *mr, int node) API
> > > > would probably be better, and make the discussion about pc_dimm.addr
> > > Wouldn't it be layering violation if we put (frontend) nodeid information
> > > into (backend) MemoryRegion?
> > 
> > We wouldn't, it would be a numa.c function that would just keep track of
> > the list of MemoryRegions for each NUMA node.
> I sill don't get how idea to use  MemoryRegions applies to above,

I was just considering getting a MemoryRegion pointer as argument
instead of (addr, length), all the rest would be the same. But:

> you can have MemoryRegions for present memory but there isn't any
> for not memory that hasn't been plugged in yet.
> numa_set_mem_node_id() looks like a way to go and a simple one at that.

I believe numa_get_node() will be expected to return info only for
memory that was already plugged. But if some machines make it return
valid info for unplugged memory also, it would be a nice extra feature.
As we don't have separate MemoryRegions for the still-unplugged areas
(yet?), your point seems to be valid.

-- 
Eduardo

Re: [Qemu-devel] [RFC PATCH v0] numa: API to lookup NUMA node by address

Reply via email to