On Fri, Jan 11, 2008 at 10:57:51AM -0600, Linas Vepstas wrote: > On 10/01/2008, Nathan Lynch <[EMAIL PROTECTED]> wrote: > > Mike Strosaker wrote: > > > > > > At the risk of repeating what others have already said, the > > > PHYP-assistance > > > method provides some advantages that the kexec method cannot: > > > - Availability of the system for production use before the dump data is > > > collected. As was mentioned before, some production systems may choose > > > not > > > to operate with the limited memory initially available after the reboot, > > > but it sure is nice to provide the option. > > > > I'm more concerned that this design encourages the user to resume a > > workload *which is almost certainly known to result in a system crash* > > before collection of crash data is complete. Maybe the gamble will > > pay off most of the time, but I wouldn't want to be working support > > when it doesn't. > > Workloads that cause crashes within hours of startup tend to be > weeded-out/discovered during pre-production test of the system > to be deployed. Since its pre-production test, dumps can be > taken in a leisurely manner. Heck, even a session at the > xmon prompt can be contemplated. > > The problem is when the crash only reproduces after days or > weeks of uptime, on a production machine. Since the machine > is in production, its got to be brought back up ASAP. Since > its crashing only after days/weeks, the dump should have > plenty of time to complete. (And if it crashes quickly after > that reboot ... well, support people always welcome ways > in which a bug can be reproduced more quickly/easily).
How do you expect to have it in full production if you don't have all resources available for it? It's not until the dump has finished that you can return all memory to the production environment and use it. This can very easily be argued in both direction, with no clear winner: If the crash is stress-induced (say a slashdotted website), for those cases it seems more rational to take the time, collect _good data_ even if it takes a little longer, and then go back into production. Especially if the alternative is to go back into production immediately, collect about half of the data, and then crash again. Rinse and repeat. Anyway -- I can agree that some of the arguments w.r.t robustness and reliability of collecting dumps can be higher using this approach. It really surprises me that there's no way to reset a device through PHYP though. Seems like such a fundamental feature. I think people are overly optimistic if they think it'll be possible to do all of this reliably (as in with consistent performance) without a second reboot though. At least without similar amounts of work being done as it would have taken to fix kdump's reliability in the first place. Speaking of reboots. PHYP isn't known for being quick at rebooting a partition, it used to take in the order of minutes even on a small machine. Has that been fixed? If not, the avoiding an extra reboot argument hardly seems like a benefit versus kdump+kexec, which reboots nearly instantly and without involvement from PHYP. -Olof _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev