Re: [PATCH] librtlsdr: force wait state after lib_cancel_transfer to avoid crash on close in windows

Martin Ling Sun, 23 Jan 2022 15:16:25 -0800

On Sun, Jan 23, 2022 at 06:08:17PM -0000, Jasper  wrote:
> 
> For my understanding, librtlsdr uses
> libusb_handle_events_timeout_completed which should terminate early if
> the dev->async_cancel flag is set. Is there still a purpose to call
> libusb_interrupt_event_handler in this case or is it required to
> guarantee the handler inspects the flag immediately, before any
> timeout or a new event (i.e. like a notification)? That did not became
> clear to me from the libusb documentation.


Just passing libusb_handle_events_completed() a pointer to the flag
doesn't give it any way to be notified or interrupted when that flag is
changed. It only gives it the ability to look at that flag when it
chooses to - which will be when it's just handled some event and is
deciding whether to wait for another.

Calling libusb_interrupt_event_handler() will interrupt whatever poll()
or similar call libusb_handle_events_completed() is blocked in, causing
it to inspect the flag and return if it's set.

Otherwise, the flag will only be inspected at the next iteration of the
event loop (i.e. when some USB event occurs, or a timeout expires). In
practice that may happen soon anyway, but I still think it makes sense
to interrupt the handler to ensure there is no unnecessary delay.

> The easiest way I can see now to assess we can safely free up the
> transfers is to count the live transfers (increase a counter by one if
> there was a successful resubmit) and reduce by one at the start of a
> libusb callback. It is then safe to terminate if the counter is zero.

The counting approach seems reasonable to me.

In my case I kept a boolean flag for each individual transfer, and
signalled completion once all of the transfers were flagged as finished.
This was partly because when debugging I wanted to be able to see the
state of each transfer.

But in retrospect a count would suffice, and is much simpler, so I think
I'll probably adopt that approach too.

> I have implemented that here:
> https://github.com/jvde-github/rtl-sdr/commit/325d47f1f3c1d9fbece9877034828355748ddd5c

I've looked at this version and I see one potential failure mode.

It's the same as one of the problems I tackled on libhackrf - the one I
previously said you were immune to, due to cancellation and event
handling being in the same thread.

On closer inspection though, I think it can still go wrong as follows:

When libusb_cancel_transfer() is called, it's possible that one of the
transfers has just completed at the kernel level - but not yet had its
completion callback called, because the event handling was interrupted
before that happened.

In this case libusb_cancel_transfer() returns LIBUSB_ERROR_NOT_FOUND,
because that transfer is no longer available to be cancelled.

You don't check the result of libusb_cancel_transfer().  But actually
even if you did, there isn't anything very helpful you can do with the
result at that stage. The real problem occurs later - read on.

Next, when libusb_handle_events_timeout_completed() is called again at
line 1930, to await the cancellation callbacks, the completion of that
transfer is still pending and now gets processed - so _libusb_callback()
gets called.

The status seen by _libusb_callback() will be LIBUSB_TRANSFER_COMPLETED,
so the transfer gets resubmitted and the transfer count reincremented.

This then repeats indefinitely - there is no logic in the callback to
prevent a completed transfer being resubmitted, so the same transfer
will be constantly resubmitted and re-run, and live_transfers will
never reach zero.

My suggested fix would be to check dev->async_cancel in the callback,
and not resubmit a transfer if that flag is set.

> Seems to work really well but given the various versions of libusb and
> supported platforms around, I am hesitant to submit as patch until it
> solves a reported problem.

I am very much of the opinion that once concurrency issues are
identified, they want fixing proactively fixed even if the potential
failures aren't being observed in practice.

When failures do occur they can be extremely hard to track down because
of the timing dependency. It's taken multiple reports of intermittent,
hard-to-reproduce problems, over a long period, for us to zero in on
these issues in libhackrf.

We got to a point where only one diligent reporter could reproduce a
failure, that seemed to only show up with their particular distro,
hardware, software versions etc.

Eventually I managed to reproduce it myself, but still had to drill down
through all the layers of gqrx, gr, osmosdr & libhackrf to figure out
what was going wrong.

Having spotted these problems, I say let's fix them properly while we
have the chance.

Regards,


Martin

Re: [PATCH] librtlsdr: force wait state after lib_cancel_transfer to avoid crash on close in windows

Reply via email to