On Tue Dec  8 13:01:13 EST 2009, n...@lsub.org wrote:
> the shutdown code is new and might be wrong. worked here.
> I'll double check.
> 

what i have been seeing is that a random machno can
execute the reboot.  this can cause the setting up of
the lapics to hang the machine.  (the mp spec doesn't
guarentee that any old processor can be the bsp, and
i'm not sure we do enough setup anyway.)  i have also
seen the clock interrupt getting in the way.  it seems
that while reboot() is running, exit() can be called
again from hzclock on the processor that's doing the
shutdown.  this causes a warm reboot rather than
a jump to the new kernel.

another problem is that there are lots delays in the
shutdown code.  these become fiddly if we're not going
through a bios reset.  this is because if the timing is
off, an ap can survive the reset and the old kernel
can print "cpu%d:
exiting" after the new kernel has started.  fun!

i can't prove this also explains the pull-the-power
hangs i've seen ~5% of the time, but it seems likely.

so what's the fix?  i'm not sure this is the best fix, but
what i decided to do is to try to get a case that's easy
to understand and already works, the uniprocessor
case.  this seems to be working pretty well.

so i wrote a kproc to park a processor.  it
1.  procwired()s (sic) itself to the processor in question;
2.  turns off interrupts and splhi's;
3.  calls idle.

then reboot was modified to
1.  wire itself to the bsp (machno==0),;
2.  halt mach 1..n;
3.  turns off interrupts;

if there's something here i have missed or apparently
don't understand, please let me know.

- erik

----

typedef struct {
        int i;
} Apshut;

void
apshut(void *v)
{
        Apshut *a;

        a = v;
        procwired(up, a->i);
        sched();
        splhi();
        if (arch)
                arch->introff();
        else
                i8259off();
        active.machs &= ~(1<<a->i);
        print("cpu%d: halt %.2ux\n", m->machno, active.machs);
        idle();
}
        
void
reboot(void *entry, void *code, ulong size)
{
        int i;
        Apshut a[MAXMACH];
        void (*f)(ulong, ulong, ulong);
        ulong *pdb;

        writeconf();

        procwired(up, 0);
        sched();

        for(i = 1; i < MAXMACH; i++){
                a[i].i = i;
                if(active.machs & 1<<i)
                        kproc("apshutdown", apshut, a + i);
        }

        while(active.machs != 1)
                sched();

        print("cpu%d: thunderbirdsarestop %d\n", m->machno, active.machs);
        splhi();
        if (arch)
                arch->introff();
        else
                i8259off();
        print("cpu%d: shutting down...\n", m->machno);

        /* turn off buffered serial console */
        serialoq = nil;

        /* shutdown devices */
        chandevshutdown();

        /*
         * Modify the machine page table to directly map the low 4MB of memory
         * This allows the reboot code to turn off the page mapping
         */
        pdb = m->pdb;
        pdb[PDX(0)] = pdb[PDX(KZERO)];
        mmuflushtlb(PADDR(pdb));

        /* setup reboot trampoline function */
        f = (void*)REBOOTADDR;
        memmove(f, rebootcode, sizeof(rebootcode));

        print("cpu%d: rebooting... %p [%p %p %lux]\n", m->machno, 
PADDR(reboot), PADDR(entry), PADDR(code), size);

        /* off we go - never to return */
        (*f)(PADDR(entry), PADDR(code), size);
}

Reply via email to