On Sat Nov 22 at 12:17:22 EST in 2008 Dave Hansen wrote:
On Fri, 2008-11-21 at 18:49 -0600, Nathan Lynch wrote:
Dave Hansen wrote:
I was handed off a bug report about a blade not booting with a, um
"newer" kernel.

If you're unable to provide basic information such as the kernel
version then perhaps this isn't the best forum for discussing this. :)

Let's just say a derivative of 2.6.27.5. I will, of course be trying to reproduce on mainline. I'm just going with the kernel closest to the bug report as I can get for now.

This reminds me. I was asked to look at a system that had all cpus and memory on node 1. I recently switched to 2.6.27.0, and had a similar failure when I tried my latest development kernel. However, I realized that the user was wanting to run my previously supported 2.6.24 kernel, and that did not have this issue, so I never got back to debugging this problem. (Both kernels had similar patches applied, but very little to mm or numa selection). I was able to fix the problem they were having and returned the machine to them without debugging the issue, but I suspect the problem was introduced to mainline between 2.6.24 and 2.6.27.

I'm thinking that we need to at least fix careful_allocation() to oops
and not return NULL, or check to make sure all it callers check its
return code.

Well, careful_allocation() in current mainline tries pretty hard to
panic if it can't satisfy the request.  Why isn't that happening?

I added some random debugging to careful_alloc() to find out.

careful_allocation(1, 7680, 80, 0)
careful_allocation() ret1: 00000001dffe4100
careful_allocation() ret2: 00000001dffe4100
careful_allocation() ret3: 00000001dffe4100
careful_allocation() ret4: c000000000000000
careful_allocation() ret5: 0000000000000000

It looks to me like it is hitting 'the memory came from a previously
 allocated node' check.  So, the __lmb_alloc_base() appears to get
something worthwhile, but that gets overwritten later.

I'm still not quite sure what this comment means. Are we just trying to
get node locality from the allocation?

My memory (and a quick look) is that careful alloc is used while we are in the process of creating the memory maps for the node. We want them to be allocated from memory on the node, but will accept memory from any node to handle the case that memory is not available in the desired node. Linux requires the maps exist for every online node.

Because we are in the process transferring the memory between allocators, the check for new_nid < nid is meant to say "if the memory did not come from the preferred node, but instead came from one we already transfered, then we need to obtain that memory from the new allocator". If it came from the preferred node or a later node, the allocation we did is valid, and will be marked in-use when we transfer that node's memory.

I also need to go look at how __alloc_bootmem_node() ends up returning
c000000000000000.  It should be returning NULL, and panic'ing, in
careful_alloc(). This probably has to do with the fact that NODE_DATA()
isn't set up, yet, but I'll double check.

We setup NODE_DATA with the result of this alloc in nid order. If early_pfs_to_nid returns the wrong value then we would obviously be in trouble here.

        /*
         * If the memory came from a previously allocated node, we must
         * retry with the bootmem allocator.
         */
        new_nid = early_pfn_to_nid(ret >> PAGE_SHIFT);
        if (new_nid < nid) {
ret = (unsigned long)__alloc_bootmem_node(NODE_DATA(new_nid),
                                size, align, 0);
                dbg("careful_allocation() ret4: %016lx\n", ret);

                if (!ret)
panic("numa.c: cannot allocate %lu bytes on node %d",
                              size, new_nid);

                ret = __pa(ret);
                dbg("careful_allocation() ret5: %016lx\n", ret);

                dbg("alloc_bootmem %lx %lx\n", ret, size);
        }

Perhaps someone can recreate this with the fake numa stuff that was added since 2.6.24? Or edit a device tree to fake the numa assignments for memory and kexec using the modified tree.

milton

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Reply via email to