On Sat Nov 22 at 12:17:22 EST in 2008 Dave Hansen wrote:
On Fri, 2008-11-21 at 18:49 -0600, Nathan Lynch wrote:
Dave Hansen wrote:
I was handed off a bug report about a blade not booting with a, um
"newer" kernel.
If you're unable to provide basic information such as the kernel
version then perhaps this isn't the best forum for discussing this.
:)
Let's just say a derivative of 2.6.27.5. I will, of course be trying
to reproduce on mainline. I'm just going with the kernel closest to
the bug report as I can get for now.
This reminds me. I was asked to look at a system that had all cpus and
memory on node 1. I recently switched to 2.6.27.0, and had a similar
failure when I tried my latest development kernel. However, I realized
that the user was wanting to run my previously supported 2.6.24 kernel,
and that did not have this issue, so I never got back to debugging this
problem. (Both kernels had similar patches applied, but very little to
mm or numa selection). I was able to fix the problem they were having
and returned the machine to them without debugging the issue, but I
suspect the problem was introduced to mainline between 2.6.24 and
2.6.27.
I'm thinking that we need to at least fix careful_allocation() to
oops
and not return NULL, or check to make sure all it callers check its
return code.
Well, careful_allocation() in current mainline tries pretty hard to
panic if it can't satisfy the request. Why isn't that happening?
I added some random debugging to careful_alloc() to find out.
careful_allocation(1, 7680, 80, 0)
careful_allocation() ret1: 00000001dffe4100
careful_allocation() ret2: 00000001dffe4100
careful_allocation() ret3: 00000001dffe4100
careful_allocation() ret4: c000000000000000
careful_allocation() ret5: 0000000000000000
It looks to me like it is hitting 'the memory came from a previously
allocated node' check. So, the __lmb_alloc_base() appears to get
something worthwhile, but that gets overwritten later.
I'm still not quite sure what this comment means. Are we just trying
to
get node locality from the allocation?
My memory (and a quick look) is that careful alloc is used while we are
in the process of creating the memory maps for the node. We want them
to be allocated from memory on the node, but will accept memory from
any node to handle the case that memory is not available in the desired
node. Linux requires the maps exist for every online node.
Because we are in the process transferring the memory between
allocators, the check for new_nid < nid is meant to say "if the memory
did not come from the preferred node, but instead came from one we
already transfered, then we need to obtain that memory from the new
allocator". If it came from the preferred node or a later node, the
allocation we did is valid, and will be marked in-use when we transfer
that node's memory.
I also need to go look at how __alloc_bootmem_node() ends up returning
c000000000000000. It should be returning NULL, and panic'ing, in
careful_alloc(). This probably has to do with the fact that
NODE_DATA()
isn't set up, yet, but I'll double check.
We setup NODE_DATA with the result of this alloc in nid order. If
early_pfs_to_nid returns the wrong value then we would obviously be in
trouble here.
/*
* If the memory came from a previously allocated node, we must
* retry with the bootmem allocator.
*/
new_nid = early_pfn_to_nid(ret >> PAGE_SHIFT);
if (new_nid < nid) {
ret = (unsigned
long)__alloc_bootmem_node(NODE_DATA(new_nid),
size, align, 0);
dbg("careful_allocation() ret4: %016lx\n", ret);
if (!ret)
panic("numa.c: cannot allocate %lu bytes on
node %d",
size, new_nid);
ret = __pa(ret);
dbg("careful_allocation() ret5: %016lx\n", ret);
dbg("alloc_bootmem %lx %lx\n", ret, size);
}
Perhaps someone can recreate this with the fake numa stuff that was
added since 2.6.24? Or edit a device tree to fake the numa
assignments for memory and kexec using the modified tree.
milton
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev