On Fri, 12 Sep 2014 11:31:46 -0400
Alan Stern <[email protected]> wrote:
> On Thu, 11 Sep 2014, Joe Lawrence wrote:
>
> > Hi Alan,
> >
> > I've got another USB bug to report that manifests during automated
> > device removal testing on RHEL7. This one hits the BUG() inside
> > qh_destroy:
>
> How reliably can you trigger this bug?
I have collected a few crashes within a few days, so somewhat
frequently.
> > 67 static void qh_destroy(struct ehci_hcd *ehci, struct ehci_qh *qh)
> > 68 {
> > 69 /* clean qtds first, and know this is not linked */
> > 70 if (!list_empty (&qh->qtd_list) || qh->qh_next.ptr) {
> > 71 ehci_dbg (ehci, "unused qh not empty!\n");
> > 72 BUG ();
> > 73 }
>
> > and finally a dump of the ehci_qh in question:
> >
> > crash> struct ehci_qh ffff88084b84dc80
> > struct ehci_qh {
> > hw = 0xffff880078d1a000,
>
> It would be good to see the contents of the ehci_qh_hw structure. That
> would tell us what device and endpoint this QH was for.
crash> struct ehci_qh_hw 0xffff880078d1a000
struct ehci_qh_hw {
hw_next = 0x78d1a062,
hw_info1 = 0x8000,
hw_info2 = 0x0,
hw_current = 0x0,
hw_qtd_next = 0x1,
hw_alt_next = 0x78d22000,
hw_token = 0x40,
hw_buf = {0x0, 0x0, 0x0, 0x0, 0x0},
hw_buf_hi = {0x0, 0x0, 0x0, 0x0, 0x0}
}
> > qh_dma = 0x78d1a000,
> > qh_next = {
> > qh = 0xffff88084efe6730,
> > itd = 0xffff88084efe6730,
> > sitd = 0xffff88084efe6730,
> > fstn = 0xffff88084efe6730,
> > hw_next = 0xffff88084efe6730,
> > ptr = 0xffff88084efe6730 << !NULL
> > },
> > qtd_list = { << list_empty
> > next = 0xffff88084b84dc98,
> > prev = 0xffff88084b84dc98
> > },
> > intr_node = {
> > next = 0x0,
> > prev = 0x0
> > },
> > dummy = 0xffff880078d22000,
> > unlink_node = {
> > next = 0xffff88084b84dcc0,
> > prev = 0xffff88084b84dcc0
> > },
> > unlink_cycle = 0x0,
> > qh_state = 0x1, << QH_STATE_LINKED
> ...
> > }
> >
> > The qtd_list is empty, contains only one entry, itself.
> >
> > crash> struct -o ehci_qh | grep td_list
> > [0x18] struct list_head qtd_list;
> > crash> p/x 0xffff88084b84dc80 + 0x18
> > $1 = 0xffff88084b84dc98
> >
> > but qh->qh_next.ptr is !NULL, so we hit the BUG. However, it seems that
> > the memory at qh->qh_next.ptr has been freed:
>
> > I'm not too familiar with the USB code stack, so any suggestions on
> > instrumentation that I can add to aid in debugging would be helpful.
> > Maybe some tracing in qh_link_async / single_unlink_async /
> > end_unlink_async /qh_link_periodic can reveal the sequence that is
> > leaving this dangling qh_next.ptr?
>
> The place to look is ehci_endpoint_disable. Did that routine get
> called for this QH? Did it hit the default case of the big switch
> statement (with its ehci_err statement)?
Not sure if there is enough residual side-effect data in a crash dump
to determine if ehci_endpoint_disable executed. However, the QH that
qh_destroy was handling did *not* have the exception bit set. (See the
first mail for the structure dump.)
Would it be reasonable to add printk debugging messages to
ehci_endpoint_disable to trace the QH in question and its qh_state?
> > Note: This does bear some resemblance to a bug that Stratus hit a few
> > years ago [1] [2], however enough of the code has changed that I'm not
> > sure the fix for that one would apply to a modern kernel.
>
> What version of the driver are you currently running?
The driver is built into a slightly modified RHEL7 3.10.0-123.6.3.el7.x86_64
kernel.
Regards,
-- Joe
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html