On 21.12.2016 08:17, Lu Baolu wrote:
Hi Mathias,
I have some comments for the implementation of xhci_abort_cmd_ring() below.
On 12/20/2016 11:13 PM, Mathias Nyman wrote:
On 20.12.2016 09:30, Baolin Wang wrote:
...
Alright, I gathered all current work related to xhci races and timeouts
and put them into a branch:
git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes
Its based on 4.9
It includes a few other patches just to avoid conflicts and make my life easier
Interesting patches are:
ee4eb91 xhci: remove unnecessary check for pending timer
0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
4f2535f xhci: Handle command completion and timeout race
b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
529a5a0 usb: xhci: fix possible wild pointer
4766555 xhci: Fix race related to abort operation
de834a3 xhci: Use delayed_work instead of timer for command timeout
69973b8 Linux 4.9
The fixes for command queue races will go to usb-linus and stable, the
reworks for stop ep watchdog timer will go to usb-next.
Still completely untested, (well it compiles)
Felipe gave instructions how to modify dwc3 driver to timeout on address
devicecommands to test these, I'll try to set that up.
All additional testing is welcome, especially if you can trigger timeouts
and races
-Mathias
Below is the latest code. I put my comments in line.
322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
323 {
324 u64 temp_64;
325 int ret;
326
327 xhci_dbg(xhci, "Abort command ring\n");
328
329 reinit_completion(&xhci->cmd_ring_stop_completion);
330
331 temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
332 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
333 &xhci->op_regs->cmd_ring);
We should hold xhci->lock when we are modifying xhci registers
at runtime.
Makes sense, but we need to unlock it before sleeping or waiting for completion.
I need to look into that in more detail.
But this was an issue already before these changes.
The retry of setting CMD_RING_ABORT is not necessary according to
previous discussion. We have cleaned code for second try in
xhci_handle_command_timeout(). Need to clean up here as well.
Yes it can be cleaned up as well, but the two cases are a bit different.
The cleaned up one was about command ring not starting again after it was
stopped.
This second try is a workaround for what we thought was the command ring failing
to stop in the first place, but is most likely due to the race that OGAWA
Hirofumi
fixed. It races if the stop command ring interrupt happens between writing the
abort
bit and polling for the ring stopped bit. The interrupt hander may start the
command
ring again, and we would believe we failed to stop it in the first place.
This race could probably be fixed by just extending the lock (and preventing
interrupts) to cover both writing the abort bit and polling for the command ring
running bit, as you pointed out here previously.
But then again I really like OGAWA Hiroumi's solution that separates the
command ring stopping from aborting commands and restarting the ring.
The current way of always restarting the command ring as a response to
a stop command ring command really limits its usage.
So, with this in mind most reasonable would be to
1. fix the lock to cover abort+CRR check, and send it to usb-linus +stable
2. rebase OGAWA Hirofumi's changes on top of that, and send to usb-linus only
3. remove unnecessary second abort try as a separate patch, send to usb-next
4. remove polling for the Command ring running (CRR), waiting for completion
is enough, if completion times out then we can check CRR. for usb-next
I'll fix the typos these patches would introduce. Fixing old typos can be done as separate
patches later.
-Mathias