On Sun, Jan 23, 2022 at 06:08:17PM -0000, Jasper wrote: > > For my understanding, librtlsdr uses > libusb_handle_events_timeout_completed which should terminate early if > the dev->async_cancel flag is set. Is there still a purpose to call > libusb_interrupt_event_handler in this case or is it required to > guarantee the handler inspects the flag immediately, before any > timeout or a new event (i.e. like a notification)? That did not became > clear to me from the libusb documentation.
Just passing libusb_handle_events_completed() a pointer to the flag doesn't give it any way to be notified or interrupted when that flag is changed. It only gives it the ability to look at that flag when it chooses to - which will be when it's just handled some event and is deciding whether to wait for another. Calling libusb_interrupt_event_handler() will interrupt whatever poll() or similar call libusb_handle_events_completed() is blocked in, causing it to inspect the flag and return if it's set. Otherwise, the flag will only be inspected at the next iteration of the event loop (i.e. when some USB event occurs, or a timeout expires). In practice that may happen soon anyway, but I still think it makes sense to interrupt the handler to ensure there is no unnecessary delay. > The easiest way I can see now to assess we can safely free up the > transfers is to count the live transfers (increase a counter by one if > there was a successful resubmit) and reduce by one at the start of a > libusb callback. It is then safe to terminate if the counter is zero. The counting approach seems reasonable to me. In my case I kept a boolean flag for each individual transfer, and signalled completion once all of the transfers were flagged as finished. This was partly because when debugging I wanted to be able to see the state of each transfer. But in retrospect a count would suffice, and is much simpler, so I think I'll probably adopt that approach too. > I have implemented that here: > https://github.com/jvde-github/rtl-sdr/commit/325d47f1f3c1d9fbece9877034828355748ddd5c I've looked at this version and I see one potential failure mode. It's the same as one of the problems I tackled on libhackrf - the one I previously said you were immune to, due to cancellation and event handling being in the same thread. On closer inspection though, I think it can still go wrong as follows: When libusb_cancel_transfer() is called, it's possible that one of the transfers has just completed at the kernel level - but not yet had its completion callback called, because the event handling was interrupted before that happened. In this case libusb_cancel_transfer() returns LIBUSB_ERROR_NOT_FOUND, because that transfer is no longer available to be cancelled. You don't check the result of libusb_cancel_transfer(). But actually even if you did, there isn't anything very helpful you can do with the result at that stage. The real problem occurs later - read on. Next, when libusb_handle_events_timeout_completed() is called again at line 1930, to await the cancellation callbacks, the completion of that transfer is still pending and now gets processed - so _libusb_callback() gets called. The status seen by _libusb_callback() will be LIBUSB_TRANSFER_COMPLETED, so the transfer gets resubmitted and the transfer count reincremented. This then repeats indefinitely - there is no logic in the callback to prevent a completed transfer being resubmitted, so the same transfer will be constantly resubmitted and re-run, and live_transfers will never reach zero. My suggested fix would be to check dev->async_cancel in the callback, and not resubmit a transfer if that flag is set. > Seems to work really well but given the various versions of libusb and > supported platforms around, I am hesitant to submit as patch until it > solves a reported problem. I am very much of the opinion that once concurrency issues are identified, they want fixing proactively fixed even if the potential failures aren't being observed in practice. When failures do occur they can be extremely hard to track down because of the timing dependency. It's taken multiple reports of intermittent, hard-to-reproduce problems, over a long period, for us to zero in on these issues in libhackrf. We got to a point where only one diligent reporter could reproduce a failure, that seemed to only show up with their particular distro, hardware, software versions etc. Eventually I managed to reproduce it myself, but still had to drill down through all the layers of gqrx, gr, osmosdr & libhackrf to figure out what was going wrong. Having spotted these problems, I say let's fix them properly while we have the chance. Regards, Martin