On Wed, Jun 10, 2015 at 11:43:19AM +0200, Igor Mammedov wrote: > On Tue, 9 Jun 2015 09:40:54 -0300 > Eduardo Habkost <ehabk...@redhat.com> wrote: > > > On Tue, Jun 09, 2015 at 11:23:19AM +0200, Igor Mammedov wrote: > > > On Mon, 8 Jun 2015 12:51:39 -0300 > > > Eduardo Habkost <ehabk...@redhat.com> wrote: > > > > > > > On Mon, Jun 08, 2015 at 11:51:03AM +0200, Igor Mammedov wrote: > > > > > On Mon, 8 Jun 2015 11:28:18 +0530 > > > > > Bharata B Rao <bhar...@linux.vnet.ibm.com> wrote: > > > > > > > > > > > On Mon, May 25, 2015 at 02:42:40PM -0300, Eduardo Habkost wrote: > > > > > > > On Mon, May 25, 2015 at 01:17:57PM +0530, Bharata B Rao wrote: > > > > > > > > On Thu, May 14, 2015 at 11:39:06AM +0200, Paolo Bonzini wrote: > > > > > > > > > On 13/05/2015 20:06, Eduardo Habkost wrote: > > > > > > > > > > Also, this introduces a circular dependency between > > > > > > > > > > pc-dimm.c and > > > > > > > > > > numa.c. Instead of that, pc-dimm could simply notify us > > > > > > > > > > when a new > > > > > > > > > > device is realized (with just (addr, end, node) as > > > > > > > > > > arguments), so we can > > > > > > > > > > save the list of memory ranges inside struct node_info. > > > > > > > > > > > > > > > > > > > > I wonder if the memory API already provides something that > > > > > > > > > > would help > > > > > > > > > > us. Paolo, do you see a way we could simply use a > > > > > > > > > > MemoryRegion as input > > > > > > > > > > to lookup the NUMA node? > > > > > > > > > > > > > > > > > > No, but I guess you could add a > > > > > > > > > numa_get/set_memory_region_node_id API > > > > > > > > > that uses a hash table. That's a variant of the "pc-dimm > > > > > > > > > could simply > > > > > > > > > notify" numa.c that you propose above. > > > > > > > > > > > > > > > > While you say we can't use MemoryRegion as input to lookup the > > > > > > > > NUMA node, > > > > > > > > you suggest that we add numa_get/set_memory_region_node_id. > > > > > > > > Does this API > > > > > > > > get/set NUMA node id for the given MemoryRegion ? > > > > > > > > > > > > > > I was going to suggest that, but it would require changing the > > > > > > > non-memdev code path to create a MemoryRegion for each node, too. > > > > > > > So > > > > > > > having a numa_set_mem_node_id(start_addr, end_addr, node_id) API > > > > > > > would > > > > > > > be simpler. > > > > > > > > > > > > In order to save the list of memory ranges inside node_info, I > > > > > > tried this > > > > > > approach where I call > > > > > > > > > > > > numa_set_mem_node_id(dimm.addr, dimm.size, dimm.node) from > > > > > > > > > > > > pc_dimm_realize(), but > > > > > > > > > > > > the value of dimm.addr is finalized only later in ->plug(). > > > > > > > > > > > > So we would have to call this API from arch code like > > > > > > pc_dimm_plug(). > > > > > > Is that acceptable ? > > > > > > > > It looks acceptable to me, as pc.c already has all the rest of the > > > > NUMA-specific code for PC. I believe it would be interesting to keep all > > > > numa.o dependencies contained inside machine code. > > > > > > > > > Could you query pc_dimms' numa property each time you need mapping > > > > > instead of additionally storing that mapping elsewhere? > > > > > > > > The original patch did that, but I suggested the > > > > numa_set_mem_node_id() API for two reasons: 1) not requiring special > > > > cases for hotplug inside numa_get_node(); 2) not introducing a circular > > > > dependency between pc-dimm.c and numa.c. > > > What circular dependency doing foreach(pc-dimm) would introduce? > > > So far pc-dimm is independent from numa.c and a regular device with no > > > dependencies (except of on backend memdev) and has it's own 'numa' > > > property > > > for providing that information to interested users. I'd rather keep > > > it separate form legacy numa.c:-numa handling. > > > > pc-dimm.c already depends on numa.c because it checks nb_numa_nodes > > inside pc_dimm_realize(). > check should be in pc_dimm_plug() instead of realize but I guess it saves > up duplication when pc-dimm reused with other targets, anyway we could move > it out into generic common function.
Agreed. > > > > > I don't understand what you mean by "legacy numa.c:-numa handling". > > Unless there's a way to query pc-dimm (or other code) for all > > (address -> numa_node) mappings without a special case for memory > > hotplug[1], I wouldn't call node_info[].node_mem "legacy". > > > > [1] And this is exactly what I want to provide with > > numa_set_mem_node_id(): an API that doesn't require special cases > > for memory hotplug. > For x86 board makers usually define address ranges -> node mapping statically. > So node_info[].node_mem & numa_set_mem_node_id() makes sense. > And sPAPR could probably do the same, no need to scan for pc-dimm devices. OK, so we are on the same page now. > > > [...] > > > What for one needs to pull dimm properties into numa.c? > > > > I am not sure I parsed the question correctly, but: > > > > Original commit message explains why numa_get_node(addr) is needed: > > > > > > This is needed by sPAPR PowerPC to support > > > > ibm,dynamic-reconfiguration-memory device tree node which is needed > > > > for memory hotplug. > > > > And to make this work, it needs to be aware of NUMA information for > > hotplugged memory too. > I've checked spapr_populate_drconf_memory() from original series, > it needs to be aware at startup about address ranges -> node mapping > including mapping partitioning of whole hotplug memory range > (i.e. not actual hotplugged memory). > -numa node_mem & numa_set_mem_node_id() are sufficient for this purpose Good. :) > > > My proposal is to do that inside the code that actually assigns > > addresses to pc-dimm: pc_dimm_realize() (pc.c). So pc-dimm.c and numa.c > > won't reference each other, and numa.c won't need any special code for > > hotplug. > you've probably meant arc_dimm_plug() instead of pc_dimm_realize(). > but it shouldn't call numa_set_mem_node_id() since partitioning is static > and is done at startup time. Sorry, I meant pc_dimm_plug(). I didn't know partitioning was static. I thought it was dynamic and defined by the "node" property on pc-dimm. (But I am confused by what you say below). > > > We could have common helpers later to avoid duplicating the same logic > > into other machines, but the point is that we don't need to make numa.c > > carry special memory hotplug code, and we don't need to make pc-dimm.c > > care about -numa. > agreed. > > We don't have static partitioning of hotplug memory address space in > x86 target because it limits flexibility of hot-plugging any amount > of memory to any node. (but we could both a static partitioning using > SRAT table and override it _PXM method for dynamic assignment). > > The issue with static partitioning is that mgmt tools have to know > addr & size of hotplug memory address space to partition it, but > QEMU doesn't provide that info so far. OK, so we agree about the numa_set_mem_node_id() API, but I am confused by what you said about static partitioning. Didn't you just say above that partitioning is static and done at startup time? > > > > > > > > > > Having a numa_set_memory_region_node_id(MemoryRegion *mr, int node) API > > > > would probably be better, and make the discussion about pc_dimm.addr > > > Wouldn't it be layering violation if we put (frontend) nodeid information > > > into (backend) MemoryRegion? > > > > We wouldn't, it would be a numa.c function that would just keep track of > > the list of MemoryRegions for each NUMA node. > I sill don't get how idea to use MemoryRegions applies to above, I was just considering getting a MemoryRegion pointer as argument instead of (addr, length), all the rest would be the same. But: > you can have MemoryRegions for present memory but there isn't any > for not memory that hasn't been plugged in yet. > numa_set_mem_node_id() looks like a way to go and a simple one at that. I believe numa_get_node() will be expected to return info only for memory that was already plugged. But if some machines make it return valid info for unplugged memory also, it would be a nice extra feature. As we don't have separate MemoryRegions for the still-unplugged areas (yet?), your point seems to be valid. -- Eduardo