On 09/10/13 04:42, Martin Pieuchot wrote: > Hi Bob, > > On 07/09/13(Sat) 08:14, RD Thrush wrote: >> Since appx. November, 2012, I've had 2 amd64 systems hang while >> spewing "ehci_idone: ex=0xffff800000...... is done!" messages to the >> serial console. The hangs are intermittent. The system is >> unresponsive to the keyboard and doesn't respond to network ping. A >> hardware reset is necessary to regain control. >> >> In order to help troubleshoot, I patched /usr/src/sys/dev/usb/ehci.c >> to panic when the forementioned message had occurred 9 times and then >> built a custom kernel with EHCI_DEBUG defined. In the past day, the >> new panic has occurred on the same machine with both an mp and sp >> kernel and I have collected basic ddb information as well as crash >> dumps. >> >> Will the ddb results from my patch [below] help troubleshoot the hang? >> If so, the largish console logs, usbdevs, pcidump and acpidump are >> located at <http://arp.thrush.com/openbsd/ehci_idone/x4/>. >> >> NB: ehcidebug=0 in the sp session, while ehcidebug=3 or 2 in the mp session. >> Setting ehcidebug=3 seemed to hang but I was able to interrupt ddb, set >> ehcidebug=2 and continue the ddb session. >> >> I appreciate any help diagnosing this problem. > > Thanks for this detailed bug report. > > You're saying that you have 2 amd64 systems with the same problem but > I see only the dmesg for one machine, does the other has the same ehci > controller?
Apparently one is ATI and the other Intel. <http://arp.thrush.com/openbsd/ehci_idone/v1/> has two console captures, "v1.1" and "v1.2", for the other machine after an ehci_idone hang (I hadn't made the panic patch yet). I was able to generate a ddb interrupt to stop the spew and gather some additional ddb info. The forementioned directory also has acpidump, pcidump, biosdecode, and dmidecode previously collected from the same kernel. If you want/need further info about the 'v1' machine, let me know and I'll boot OpenBSD and get the info. x4: x2:x4/ehci_idone 1481>egrep 'ehci[0-9]' console.mp | grep -w at ehci0 at pci0 dev 18 function 2 "ATI SB700 USB2" rev 0x00ehci0: offs=32 usb0 at ehci0: USB revision 2.0 ehci1 at pci0 dev 19 function 2 "ATI SB700 USB2" rev 0x00ehci1: offs=32 usb1 at ehci1: USB revision 2.0 v1: x2:v1/ehci_idone 1479>egrep 'ehci[0-9]' v1.1 | grep -w at ehci0 at pci0 dev 26 function 0 "Intel 7 Series USB" rev 0x04: apic 2 int 23 usb0 at ehci0: USB revision 2.0 ehci1 at pci0 dev 29 function 0 "Intel 7 Series USB" rev 0x04: apic 2 int 23 usb1 at ehci1: USB revision 2.0 > The problem you are seeing is related to the way ehci transfers are > aborted. The abortion process is subtly broken. > > For the archives what happens in your case is that the timeout for > one of the transfers fires and enqueue an abort task (ehci_timeout > in your log). This abort task get scheduled tries to deactivate > the qTDs, asks for an Interrupt on Async Advance Doorbell and goes > to sleep (ehci_sync_hc in your log). > Then the interrupt happens (ehci_intr1: door bell), wakeups the > task and goes into the softinterrupt path to process the finished > transfers. Here the driver discovers that the transfer that timed > out is finished (whoa!) and tries to handles it. But since this > transfer has been marked as TIMEOUT (ehci_idone: aborted in your > log), it does nothing and bails. > > Apparently the abort task never get rescheduled and your transfer > is never removed from the list, certainly because the hardware > keeps interrupting your systems, so you're livelock ;) > > But all of that happens because a timeout fires for one of your > transfers, apparently some ATI controllers needs one more quirk, > as your problem looks like a dropped interrupt. Does the diff > below helps? Thank you very much for the detailed analysis and patch. I'll build a -current kernel and try it. Would there be a complementary patch for the (above) Intel ehci controller? > Index: pci/ehci_pci.c > =================================================================== > RCS file: /home/ncvs/src/sys/dev/pci/ehci_pci.c,v > retrieving revision 1.26 > diff -u -p -r1.26 ehci_pci.c > --- pci/ehci_pci.c 15 Apr 2013 09:23:01 -0000 1.26 > +++ pci/ehci_pci.c 10 Sep 2013 08:19:28 -0000 > @@ -204,10 +204,16 @@ ehci_pci_attach(struct device *parent, s > else > snprintf(sc->sc.sc_vendor, sizeof(sc->sc.sc_vendor), > "vendor 0x%04x", PCI_VENDOR(pa->pa_id)); > - > + > /* Enable workaround for dropped interrupts as required */ > - if (sc->sc.sc_id_vendor == PCI_VENDOR_VIATECH) > + switch (sc->sc.sc_id_vendor) { > + case PCI_VENDOR_ATI: > + case PCI_VENDOR_VIATECH: > sc->sc.sc_flags |= EHCIF_DROPPED_INTR_WORKAROUND; > + break; > + default: > + break; > + } > > ehci_pci_takecontroller(sc, 0); > r = ehci_init(&sc->sc); > Index: usb/ehci.c > =================================================================== > RCS file: /home/ncvs/src/sys/dev/usb/ehci.c,v > retrieving revision 1.134 > diff -u -p -r1.134 ehci.c > --- usb/ehci.c 12 Jun 2013 11:42:01 -0000 1.134 > +++ usb/ehci.c 10 Sep 2013 08:19:28 -0000 > @@ -2943,11 +2943,11 @@ ehci_timeout_task(void *addr) > } > > /* > - * Some EHCI chips from VIA seem to trigger interrupts before writing back > the > - * qTD status, or miss signalling occasionally under heavy load. If the host > - * machine is too fast, we we can miss transaction completion - when we scan > - * the active list the transaction still seems to be active. This generally > - * exhibits itself as a umass stall that never recovers. > + * Some EHCI chips from VIA / ATI seem to trigger interrupts before writing > + * back the qTD status, or miss signalling occasionally under heavy load. > + * If the host machine is too fast, we can miss transaction completion - when > + * we scan the active list the transaction still seems to be active. This > + * generally exhibits itself as a umass stall that never recovers. > * > * We work around this behaviour by setting up this callback after any > softintr > * that completes with transactions still pending, giving us another chance > to