Re: [3/5] 2.6.21-rc4: known regressions (v2)

Rafael J. Wysocki Mon, 26 Mar 2007 13:01:50 -0800

On Sunday, 25 March 2007 22:37, Eric W. Biederman wrote:
> "Rafael J. Wysocki" <[EMAIL PROTECTED]> writes:
> 
> > On Sunday, 25 March 2007 14:56, Eric W. Biederman wrote:
> >> "Rafael J. Wysocki" <[EMAIL PROTECTED]> writes:
> >> 
> >> > Yes, in kernel/power/disk.c:power_down() .
> >> >
> >> > Please comment out the disable_nonboot_cpus() in there and retest (but
> > please
> >> > test the latest Linus' tree).
> >> 
> >> <rant>
> >> 
> >> Why do we even need a disable_nonboot_cpus in that path?  machine_shutdown
> >> on i386 and x86_64 should take care of that.  Further the code that 
> >> computes
> >> the boot cpu is bogus (not all architectures require cpu == 0 to be
> >> the boot cpu), and disabling non boot cpus appears to be a strong
> >> x86ism, in the first place.
> >
> > Yes.
> >  
> >> If the only reason for disable_nonboot_cpus there is to avoid the
> >> WARN_ON in init_low_mappings() we should seriously consider killing
> >> it.
> >
> > We have considered it, but no one was sure that it was a good idea.
> 
> The problem with the current init_low_mappings is that it hacks the
> current page table.  If we can instead use a different page table
> the code becomes SMP safe.


What exactly is the danger here?

> I have extracted the patch that addresses this from the relocatable
> patchset and appended it for sparking ideas.  It goes a little
> farther than we need to solve this issue but the basics are there.
> 
> >> If we can wait for 2.6.22 the relocatable x86_64 patchset that 
> >> Andi has queued, has changes that kill the init_low_mapping() hack.
> >
> > I think we should kill the WARN_ON() right now, perhaps replacing it with
> > a FIXME comment.
> 
> Reasonable.
> 
> >> I'm not very comfortable with calling cpu_down in a common code path
> >> right now either.  I'm fairly certain we still don't have that
> >> correct.  So if we confine the mess that is cpu_down to #if
> >> defined(CPU_HOTPLUG) && defined(CONFIG_EXPERIMENTAL) I don't care.
> >> If we start using it everywhere I'm very nervous.
> >> migration when bringing a cpu down is strongly racy, and I don't think
> >> we actually put cpus to sleep properly either.
> >
> > I'm interested in all of the details, please.  I seriously consider dropping
> > cpu_up()/cpu_down() from the suspend code paths.
> 
> So I'm not certain if in a multiple cpu context we can avoid all of the
> issues with cpu hotplug but there is a reasonable chance so I will
> explain as best I can.
> 
> Yanking the appropriate code out of linuxbios the way a processor should stop
> itself is to send an INIT IPI to itself.  This puts a cpu into an optimized
> wait for startup IPI state where it is otherwise disabled.  This is the state
> any sane BIOS will put the cpus into before control is handed off to the 
> kernel.
> 
> > static inline void stop_this_cpu(void)
> > {
> >         unsigned apicid;
> >         apicid = lapicid();
> > 
> >         /* Send an APIC INIT to myself */
> >         lapic_write(LAPIC_ICR2, SET_LAPIC_DEST_FIELD(apicid));
> >         lapic_write(LAPIC_ICR, LAPIC_INT_LEVELTRIG | LAPIC_INT_ASSERT | 
> > LAPIC_DM_INIT);
> >         /* Wait for the ipi send to finish */
> >         lapic_wait_icr_idle();
> > 
> >         /* Deassert the APIC INIT */
> >         lapic_write(LAPIC_ICR2, SET_LAPIC_DEST_FIELD(apicid));
> >         lapic_write(LAPIC_ICR,  LAPIC_INT_LEVELTRIG | LAPIC_DM_INIT);
> >         /* Wait for the ipi send to finish */
> >         lapic_wait_icr_idle();
> > 
> >         /* If I haven't halted spin forever */
> >         for(;;) {
> >                 hlt();
> >         }
> > }
> 
> I'm not certain what to do with the interrupt races.  But I will see
> if I can explain what I know.
> 
> <braindump>
> 
> - Most ioapics are buggy.
> - Most ioapics do not follow pci-ordering rules with respect to
>   interrupt message deliver so ensuring all in-flight irqs have
>   arrived somewhere is very hard.
> - To avoid bugs we always limit ourselves to reprogramming the ioapics
>   in the interrupt handler, and not considering an interrupt
>   successfully reprogrammed until we have received an irq in the new
>   location.
> - On x86 we have two basic interrupt handling modes.
>   o logical addressing with lowest priority delivery.
>   o physical addressing with delivery to a single cpu.
> - With logical addressing as long as the cpu is not available for
>   having an interrupt delivered to it the interrupt will be
>   never be delivered to a particular cpu.  Ideally we also update
>   the mask in the ioapic to not target that cpu.
> - With physical addressing targeting a single cpu we need to reprogram
>   the ioapics not to target that specific cpu.  This needs to happen
>   in the interrupt handler and we need to wait for the next interrupt
>   before we tear down our data structures for handling the interrupt.
> 
>   The current cpu hotplug code attempts to reprogram the ioapics from
>   process context which is just wrong.

I wasn't aware of that.

> Now as part of suspend/resume I think we should be programming the
> hardware not to generate interrupts in the first place at the actual
> hardware devices so we can likely avoid all of the code that
> reprograms interrupts while they are active.

The devices are expected (and in fact required) not to generate interrupts
and not to do any DMA transfers after they have been frozen.

> If we can use things like pci ordering rules to ensure the device will
> never fire the interrupt until resumed we should be able to disable interrupts
> synchronously.  Something that we can not safely do in the current
> cpu hotplug scenario.  Which should make the problem of doing the
> work race free much easier.

I think this is doable.

> I don't know if other architectures need to disable cpus or not before
> doing a suspend to ram or a suspend to disk.

In the suspend to disk context we must ensure that the other CPUs won't
modify memory in any way while the image is being created, so I think we
should at least make them loop in a safe place and refuse to take any
interrupts.

> I also don't know if we have any code that brings up the other cpus
> after we have been suspended.  In either the cpu hotplug paths or the
> architecture specific paths.

I'm not sure what you mean.

Anyway, currently cpu_up() is used to enable the other CPUs when we have
finished with the image (as well as during the resume).
 
> The bottom line is after the partial review of the irq handling during
> cpu shutdown a little while ago that I consider the current cpu
> hotplug code to be something that works when you are lucky.  There are
> to many hardware bugs to support the general case it tries to
> implement.

Well, we've been using it for suspending SMP boxes for quite some time now
and there haven't been any major low-level problems.  The current problems are
related to the fact that the CPU hotplug calls lots of notifiers that need not
run during the suspend or resume (or in power_down() for that matter).

> I have not reviewed the entire cpu hotplug path nor have I even tried
> suspend/resume.  But I have been all and down the boot and shutdown
> code paths so I understand the general problem.
> 
> The other part I know is that cpu numbers are assigned at the
> architectures discretion.  So unless the architecture code or the
> early boot code sets a variable remembering which was the boot cpu
> there is no reason to believe we can deduce it.

I'm not sure if we have to.  On x86_* we cannot turn the 1st CPU off anyway.

> In addition I don't know if any other architectures have anything resembling
> the x86 requirement to disable non-boot cpus.  I do know machine_shutdown
> on i386 and x86_64 performs that action (if not perfectly) so at
> least currently we should be able to do this in architecture
> specific code.
> 
> </braindump>

Thanks a lot for the info.

The patch looks a bit too complicated for a quick fix.  I think we'll need to
remove that WARN_ON() in 2.6.21 after all.

Greetings,
Rafael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [3/5] 2.6.21-rc4: known regressions (v2)

Reply via email to