On Fri, Jan 01, 2016 at 07:20:34PM +0000, Ben Hutchings wrote: > Control: tag -1 moreinfo > > On Fri, 2015-12-25 at 04:26 +1030, Ron wrote: > > Source: linux > > Version: 3.16.7-ckt20-1 > > Severity: important > > > > Hi, > > > > While testing the code from here: http://www.bitbabbler.org I ran into a > > problem on recently updated Jessie systems where devices plugged into an > > XHCI port would always time out between transfers when the transfer size > > was larger than 16kB. And since it wasn't doing that a few weeks ago, I > > went digging :) > > > > All the following was tested on amd64. > > > > I can confirm that 3.16.7-ckt17-1 does not have this problem and it's a > > newly introduced regression that's first seen with the ckt20 patch set. > > I can also confirm that 4.2.0-0.bpo.1 from jessie-backports does not > > have this problem. > > > > The cause appears to be a botched backport of > > e210c422b6fdd2dc123bedc588f399aefd8bf9de (from Linus' repo) because I can > > also confirm that a locally built kernel from the 3.16.7-ckt20-1 source > > with only that patch reverted also does not have this problem. > > Are you seeing any error messages in the kernel log? Do they look > anything like those in <https://bugs.debian.org/809430>? If not, > please send them.
No, nothing outwardly complained to the logs. The first symptom that I saw, was after updating to ckt20, devices in XHCI ports dropped to a ridiculously slow transfer rate (devices ECHI ports were unaffected). Debug tracing the userspace code showed that transfers were frequently just timing out (with a 5 second timeout set for libusb requests). I'd get one transfer succeed, then the next 2 requests would time out before another successful transfer would occur. lather, rinse, repeat. This is only happening with 'large' read requests though. If I cap the transfer request size to 16kB or less, then it appears to work normally, but anything larger than that hits this problem. I tested that because I know there was previously a problem with XHCI controllers and large transfers, which was addressed in this patch to the kernel 19181bc50e1b8e92a7a3b3d78637c6dc5c0b5a1b, and corresponding patches to libusb to deal with the fact it couldn't rely on the bulk continuation flag working with those. But those had fixed this up until the ckt20 kernel. > > If it's really important that we have that patch for some reason, then > > it looks like (at least) 40a3b775f49c2784c96b19170fd2478e5e5511a1 would > > also need to be backported too, and maybe some other changes that effect > > this as well. > [...] > > Have you tried applying that on top of the current stable kernel > source? It doesn't look like a dependency to me. I haven't no. I figured the first question was whether we could simply revert this one, since I couldn't find any report of a bug that it fixed in 3.16 ... My suspicion there came from the fact that the patch we have applied triggers under different conditions to what it would if that prior patch had also been backported. In the original source, the patch involved code which looks like: if (trb_comp_code == COMP_STOP_SHORT) { ... } else if (event_trb == td->last_trb) { + if (td->urb_length_set && trb_comp_code == COMP_SHORT_TX) While the backported version was massaged to look like: if (trb_comp_code == COMP_STOP_SHORT) ... if (event_trb == td->last_trb) { + if (td->urb_length_set && trb_comp_code == COMP_SHORT_TX) With the difference coming from that earlier 40a3b7 patch. I went over all the xhci related patches between ckt17 and ckt20, and that was the most immediately suspicious thing that caught my eye, and when reverting it fixed the problem, I stopped there to get your input. There's a few other non-trivial patches in the upstream kernel tree between those two, and without spending some serious time to get familiar with this code and analyse how they are all related in detail, I didn't feel confident to "just throw a few extra backported patches" at it - since that could well break something else that I'm not testing even if it fixes the bug I'm seeing. I can test if sticking the 'else' back in there changes this behaviour (which I suspect it might) - but I'm not yet comfortably convinced there is nothing in the intervening patches this might not also rely on (or even that doing that would really fix 'everything' here). But it's an easy test and a small patch, so I'll run it and see what happens. (and I'll run some tests on 4.3.3-2 as well after that, since the only machine I currently have running it has everything bound to EHCI ports - but #809430, looks different to what I'm seeing. I think in my case, things are just stalling by not completing the TD when it really is in fact completed ... while that bug looks like a much more complete failure of something somewhere). Cheers, Ron