On 10/01/2008, Olof Johansson <[EMAIL PROTECTED]> wrote: > On Wed, Jan 09, 2008 at 10:12:13PM -0600, Linas Vepstas wrote: > > On 09/01/2008, Olof Johansson <[EMAIL PROTECTED]> wrote: > > > On Wed, Jan 09, 2008 at 08:33:53PM -0600, Linas Vepstas wrote: > > > > > > > Heh. That's the elbow-grease of this thing. The easy part is to get > > > > the core function working. The hard part is to test these various > > > > configs, > > > > and when they don't work, figure out what went wrong. That will take > > > > perseverence and brains. > > > > > > This just sounds like a whole lot of extra work to get a feature that > > > already exists. > > > > Well, no. kexec is horribly ill-behaved with respect to PCI. The > > kexec kernel starts running with PCI devices in some random > > state; maybe they're DMA'ing or who knows what. kexec tries > > real hard to whack a few needed pci devices into submission > > but it has been hit-n-miss, and the source of 90% of the kexec > > headaches and debugging effort. Its not pretty. > > It surprises me that this hasn't been possible to resolve with less than > architecting a completely new interface, given that the platform has > all this fancy support for isolating and resetting adapters. After all, > the exact same thing has to be done by the hypervisor before rebooting > the partition.
OK, point taken. -- The phyp interfaces are there for AIX, which I guess must not have kexec-like ability. So this is a case of Linux leveraging a feature architected for AIX. -- There's also this idea, somewhat weak, that the crash may have corrupted the ram where the kexec kernel sits. For someone who is used to seeing crashes due to null pointer deref's, this seems fairly unlikely. But perhaps crashes in production systems are more mind-bending. (we did have a case where a USB stick used for boot continued to scribble on memory long after it was supposed to be quiet and unused. This resulted in a very hard to debug crash.) A solution to a corrupted kexec kernel would be to disable memory access to where kexec sits, e.g un-mapping or making r/o the pages where it lies. This begs the questions of "who unhides the kexec kernel", and "what if this 'who' gets corrupted"? In short, the kexec kernel does not boot exactly the same as a cold boot, and so this opens a can of worms about "well, what's different, how do we minimize these differences, etc." and I think that lead AIX to punt, and say "lets just use one single, well-known boot loader/ boot sequence instead of inventing a new one", thus leading to the phyp design. But that's just my guess.. :-) --linas _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev