On 04/18/2014 09:03 AM, Nishanth Aravamudan wrote: > On 18.04.2014 [08:46:55 +1000], Benjamin Herrenschmidt wrote: >> On Fri, 2014-04-18 at 08:43 +1000, Alexey Kardashevskiy wrote: >>> On 04/18/2014 06:25 AM, Nishanth Aravamudan wrote: >>>> Hi Alexey, >>>> >>>> Prior to the $SUBJECT commit, we could present memoryless node0s to >>>> guests. Now, we indicate that we don't have the requisite 128M for the >>>> RMA if node 0 has no memory. Note that a memoryless node0 is possible >>>> under PowerVM (but not predictably present) so I was hoping to use KVM >>>> to test relevant fixes for memoryless nodes. >>>> >>>> I think this change is a misinterpretation of the PAPR standard, though. >>>> Yes, the RMA must be in the first block of memory, but that isn't >>>> necessarily on node 0. The topology of a PAPR-compliant guest does not >>>> require a node 0 (and in fact, under PowerVM, Linux doesn't actually >>>> require node 0 either, but it would under KVM). >>>> >>>> Thoughts? I suppose it's fine to say that node 0 must be sufficiently >>>> populated under KVM -- there's not really a reason to not have memory on >>>> a given node (except maybe ballooning). I can keep the commit reverted >>>> locally for testing purposes. Just wanted to see if the semantic change >>>> was intentional. >>> >>> >>> PAPR spec 2.7: >>> C.6.6 Memory Node >>> === >>> This section defines the PAPR modifications to the OF /memory node. In >>> PAPR, the memory allocated to an OS image >>> may be divided into a number of allocation units called ???regions??? or >>> ???Logical Memory Blocks (LMB). An OS image >>> may be dynamically allocated additional regions or may be asked to release >>> regions. Each LMB is either represented in >>> the device tree by its own /memory node or by an entry in >>> /ibm,dynamic-reconfiguration-memory nodes >>> (see Section C.6.6.2??? ???ibm,dynamic-reconfiguration-memory?????? on page >>> 1089). >>> The /memory node that refers to the >>> storage starting at real address zero (???reg??? property starting at the >>> value >>> zero) always remains allocated to an OS image. >>> >>> The client program is initially loaded into this storage, called the RMA, >>> that is represented by the first value of the >>> ???reg??? property of this first /memory node. >>> === >>> >>> The last sentence is why the change was made. It does not say "first >>> populated" node. I am adding Ben as he had very strong opinion about this >>> thing. >> >> You are confusing device-tree node with NUMA nodes. >> >> Yes, it must be the LMB at address 0, which is the /memory node, but >> that doesn't have to be NUMA node 0. > > Yeah, so I think the check that was added: > > - if (spapr->rma_size > node0_size) { > - fprintf(stderr, "Error: Numa node 0 has to span the RMA > (%#08"HWADDR_PR > - spapr->rma_size); > - exit(1); > - } > > incorrectly is checking against node0_size? It should be checking > against the first LMB instead, right?
Which would be what in the current QEMU? If I read the current QEMU code correctly, NUMA nodes and device tree nodes are the same thing for SPAPR now, see spapr_populate_memory() function. -- Alexey