On 12/11/15 12:16, Hans Petter Selasky wrote:
On 12/11/15 11:12, Alexander V. Chernikov wrote:
11.12.2015, 12:15, "Hans Petter Selasky" <h...@selasky.org>:
Hi,
Pulling the nail out of the haystack hopefully.
Any ideas on where next to look?
Adrian: In your dump aswell I see:
la_flags = 1
That means there was a race calling arptimer() and removing the "lle".
Yes. The interesting part here is why lle is removed. There are quite
a few reasons: either interface address deleted or interface going
down, or explicit delete request.
That's why I asked Adrian about interface stuff (and haven't got a
reply).
Alexander: Can you comment on the following patch:
> Index: netinet/if_ether.c
> ===================================================================
> --- netinet/if_ether.c (revision 291256)
> +++ netinet/if_ether.c (working copy)
> @@ -185,7 +185,13 @@
> LLE_WUNLOCK(lle);
> return;
> }
> - ifp = lle->lle_tbl->llt_ifp;
> + if (lle->la_flags & LLE_LINKED) {
> + ifp = lle->lle_tbl->llt_ifp;
> + } else {
> + /* XXX RACE entry has been freed */
> + llentry_free(lle);
> + return;
> + }
> CURVNET_SET(ifp->if_vnet);
>
> if ((lle->la_flags & LLE_DELETED) == 0) {
We need a check in arptimer() that the lle is still linked before
Yes, I had exactly that approach in mind. (And nd6_llinfo_timer()
needs the same fix).
So, would you commit it or should I?
proceeding, in there from what I can see. Because the callback is not
protected by a mutex, it is not atomically stopped by callout_stop().
Hi,
Talking to Randall offlist, I see some more trouble. Let's get
everything straight before making a fix. There is one more race I see:
The start of the arptimer() callback looks like this:
> static void
> arptimer(void *arg)
> {
POINT0
> LLE_WLOCK(lle);
> if (callout_pending(&lle->lle_timer)) {
POINT1
> LLE_WUNLOCK(lle);
> return;
> }
The code starting the callback looks like this:
> LLE_ADDREF(la);
> la->la_expire = time_uptime;
> canceled = callout_reset(&la->lle_timer, hz *
V_arpt_down,
> arptimer, la);
> if (canceled)
> LLE_REMREF(la);
Which can be written like this:
> la->la_expire = time_uptime;
> canceled = callout_reset(&la->lle_timer, hz * V_arpt_down,
> arptimer, la);
> if (canceled == 0)
> LLE_ADDREF(la);
In case we are at POINT0 in arptimer, callout_reset() will not be able
to cancel the callout and will return 0. We should also drop one ref at
POINT1, or rewrite the code a bit, which might need Randall's help in
the callout subsystem area.
Hi,
Looking at the version history, I see Gleb Smirnoff and Randall, heavily
involved with these code paths. Gleb and Randall, do you have any
comments on the above findings?
Gleb+Randall: Do you agree or disagree there is a race as pointed out above?
Ways forward:
1) Revert r278472 (done by Randall) and fix r238990 (done by Gleb) to
use the new callout_async_drain() instead of callout_stop() to avoid
using the rm lock after free.
2) Use callout_stop() before callout_reset() and add a reference when
the callout was not pending nor completing. We need to use
callout_stop() in this case because callout_reset() only has two return
values and cannot be used to distinguish the three callout states in use.
3) Modify callout_reset() to have three return values and fix the
arptimer code to not leak references.
--HPS
_______________________________________________
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"