On 2012-08-23 08:24, Matthew Ogilvie wrote: > This patch provides a way to optionally suppress spurious interrupts, > as a workaround for systems described below: > > Some old operating systems do not handle spurious interrupts well, > and qemu tends to generate them significantly more often than > real hardware. > > Examples: > - Microport UNIX System V/386 v 2.1 (ca 1987) > (The main problem I'm fixing: Without this patch, it panics > sporadically when accessing the hard disk.) > - AT&T UNIX System V/386 Release 4.0 Version 2.1a (ca 1991) > See screenshot in "QEMU Official OS Support List": > http://www.claunia.com/qemu/objectManager.php?sClass=application&iId=9 > (I don't have this system to test.) > - A report about OS/2 boot lockup from 2004 by Hampa Hug: > http://lists.nongnu.org/archive/html/qemu-devel/2004-09/msg00367.html > (My patch was partially inspired by his.) > Also: > http://lists.nongnu.org/archive/html/qemu-devel/2005-06/msg00243.html > (I don't have this system to test.) > > Signed-off-by: Matthew Ogilvie <mmogilvi_q...@miniinfo.net> > --- > > Note: checkpatches.pl gives an error about initializing the global > "int no_spurious_interrupt_hack = 0;", even though existing lines > near it are doing the same thing. Should I give precedence to > checkpatches.pl, or nearby code? > > There was no version 1 of this patch; this was the last thing I had to > work around to get UNIX running. > > High level symptoms: > 1. Despite using this UNIX system for nearly 10 years (ca 1987-1996) > on an early 80386, I don't remember ever seeing any crash like > this. I vaguely remember I may have had one or two crashes for > which I don't have other explanations that perhaps could have > been this, but I don't remember the error messages to confirm it. > 2. It is somewhat random when UNIX crashes when running in qemu. > - Sometimes it crashes the first time the floppy-based installer > tries to access the hard disk (partition table?). > - Other times (though fairly rarely), it actually finishes > formatting and copying the first disk's files to the > hard disk without crashing. > - On the other hand, I've never seen it successfully boot from > the hard disk without this patch. An attempt to boot from > the hard drive always panics quite early. > 3. I tried -win2k-hack instead, thinking maybe the hard disk is just > responding faster than UNIX expected. But it doesn't seem > to have any effect. UNIX still panics sporadically the same way. > - TANGENT: I was going to see if my patch provides an > alternative fix for installing Windows 2000, but > I was unable to reproduce the original -win2k-hack problem at > all (with neither -win2k-hack NOR this patch). Maybe > some other change has fixed it some other way? Or maybe > it is only an issue in configurations I didn't test? > (KVM instead of TCG? Less RAM? Something else?) > It might be worth doing a little more investigation, > and eliminating the -win2k-hack option if appropriate. > 4. If I enable KVM, I get a different error very early in > bootup (in splx function instead of splint), and this patch > doesn't help. > > ============ > My low level analysis of what is going on: > > It is hard to track down all the details, but based on logging a > lot of qemu IRQ stuff, and setting a breakpoint in the earliest > panic-related UNIX function using gdb, it looks like: > > 1. It is near the end of servicing a previous IRQ14 from the > hard disk. > 2. The processor has interrupts disabled (I think), while UNIX > clears the slave 8259's IMR (mask) register (sets it to 0), allowing > all interrupts to be passed on to the master. > 3. While in that state, IRQ14 is raised (on the slave), which > gets propagated to the master (IRQ2), but the CPU > is not interrupted yet. > 4. UNIX then masks the slave 8259's IMR register > completely (sets to 0xff). > 5. Because the master elcr register is set (by BIOS; UNIX never > touches it) to edge trigger for IRQ2, the master latched on > to IRQ2 earlier, and continues to assert the processors INT line > (the env->interrupt_request&CPU_INTERRUPT_HARD bit) even > after all slave IRQs have been masked off (clearing the input > IRQ2). > 6. Finally, UNIX enables CPU interrupts and the interrupt is delivered > to the CPU, which ends up as a spurious IRQ15 due to the > slave's imr register. UNIX doesn't know what to do with > that, and panics/halts. > > I'm not sure why it only sporadically hits this sequence of events. > There doesn't seem to be other IRQs asserted or serviced anywhere > in the near past; the last several were all IRQ14's. But I can't > help feeling I'm not reading the log output correctly or something, > because that doesn't make sense. Maybe there is there some kind > of a-few-instructions delay before a CPU interrupt is actually > deliviered after interrupts are enabled, or some delay in raising > IRQ14 after a hard drive operation is requested, and such delays > need to fall into a narrow window of opportunity left by UNIX? > > I can get a disassembly of the UNIX kernel using a "coff"-enabled > build of GNU objdump, giving function names but not much else. > But I haven't studied it in enough detail to actually find the > relevant code path that is manipulating imr as described above. > However, this old post outlines some of the high level theory > of UNIX spl*() functions: > http://www.linuxmisc.com/29-unix-internals/4e6c1f6fa2e41670.htm > > If anyone wants to look into this further, I can provide access to the > initial boot install floppy, at least. Email me. (Without the rest > of the install disks, it isn't much use for anything except testing > virtual machines like qemu against rare corner cases...) > > ============ > Alternative Approaches: > > An alternative to this patch that might work (I haven't tried) would > be to have BIOS set the master's elcr register 0x04 bit, making IRQ2 > level triggered instead of edge triggered. I'm not sure what other > effects this might have. Maybe it would actually be a more accurate > model (I haven't checked documentation; maybe "slave mode" of a > IRQ line into the master is supposed to be level triggered?) > > Or perhaps find a way to model the minimum timescale that a interrupt > request needs to be active to be recognized? > > Or maybe my analysis isn't correct; I wasn't able to find the > relevant code path in the UNIX kernel. > > ============ > > cpu-exec.c | 12 +++++++----- > hw/i8259.c | 18 ++++++++++++++++++ > qemu-options.hx | 12 ++++++++++++ > sysemu.h | 1 + > vl.c | 4 ++++ > 5 files changed, 42 insertions(+), 5 deletions(-) > > diff --git a/cpu-exec.c b/cpu-exec.c > index 134b3c4..c309847 100644 > --- a/cpu-exec.c > +++ b/cpu-exec.c > @@ -329,11 +329,15 @@ int cpu_exec(CPUArchState *env) > 0); > env->interrupt_request &= ~(CPU_INTERRUPT_HARD | > CPU_INTERRUPT_VIRQ); > intno = cpu_get_pic_interrupt(env); > - qemu_log_mask(CPU_LOG_TB_IN_ASM, "Servicing > hardware INT=0x%02x\n", intno); > - do_interrupt_x86_hardirq(env, intno, 1); > - /* ensure that no TB jump will be modified as > - the program flow was changed */ > - next_tb = 0; > + if (intno >= 0) { > + qemu_log_mask(CPU_LOG_TB_IN_ASM, > + "Servicing hardware > INT=0x%02x\n", > + intno); > + do_interrupt_x86_hardirq(env, intno, 1); > + /* ensure that no TB jump will be modified as > + the program flow was changed */ > + next_tb = 0; > + } > #if !defined(CONFIG_USER_ONLY) > } else if ((interrupt_request & CPU_INTERRUPT_VIRQ) > && > (env->eflags & IF_MASK) && > diff --git a/hw/i8259.c b/hw/i8259.c > index 6587666..7ecb7e1 100644 > --- a/hw/i8259.c > +++ b/hw/i8259.c > @@ -26,6 +26,7 @@ > #include "isa.h" > #include "monitor.h" > #include "qemu-timer.h" > +#include "sysemu.h" > #include "i8259_internal.h" > > /* debug PIC */ > @@ -193,6 +194,20 @@ int pic_read_irq(DeviceState *d) > pic_intack(slave_pic, irq2); > } else { > /* spurious IRQ on slave controller */ > + if (no_spurious_interrupt_hack) { > + /* Pretend it was delivered and acknowledged. If > + * it was spurious due to slave_pic->imr, then > + * as soon as the mask is cleared, the slave will > + * re-trigger IRQ2 on the master. If it is spurious for > + * some other reason, make sure we don't keep trying > + * to half-process the same spurious interrupt over > + * and over again. > + */ > + s->irr &= ~(1<<irq); > + s->last_irr &= ~(1<<irq); > + s->isr &= ~(1<<irq); > + return -1; > + } > irq2 = 7; > } > intno = slave_pic->irq_base + irq2; > @@ -202,6 +217,9 @@ int pic_read_irq(DeviceState *d) > pic_intack(s, irq); > } else { > /* spurious IRQ on host controller */ > + if (no_spurious_interrupt_hack) { > + return -1; > + } > irq = 7; > intno = s->irq_base + irq; > } > diff --git a/qemu-options.hx b/qemu-options.hx > index 03e13ec..57bb0b4 100644 > --- a/qemu-options.hx > +++ b/qemu-options.hx > @@ -1188,6 +1188,18 @@ Windows 2000 is installed, you no longer need this > option (this option > slows down the IDE transfers). > ETEXI > > +DEF("no-spurious-interrupt-hack", 0, QEMU_OPTION_no_spurious_interrupt_hack, > + "-no-spurious-interrupt-hack disable delivery of spurious > interrupts\n", > + QEMU_ARCH_I386) > +STEXI > +@item -no-spurious-interrupt-hack > +@findex -no-spurious-interrupt-hack > +Use it as a workaround for operating systems that drive PICs in a way that > +can generate spurious interrupts, but the OS doesn't handle spurious > +interrupts gracefully. (e.g. late 80s/early 90s versions of ATT UNIX > +and derivatives)
Has to mention or even actively warn that it doesn't work with KVM and its in-kernel irqchip (as that PIC model lacks your hack). However, I strongly suspect you are nastily papering over an issue in some device model. So I would prefer to dig deeper before installing this in upstream (also due to its dependency on the userspace PIC model). Jan
signature.asc
Description: OpenPGP digital signature