(Re-cc freebsd-net, because this is useful information)
On 27 Mar 2018, at 13:07, Reshad Patuck wrote:
The epair crash occurred again today running the epair module code
with the added dtrace sdt providers.
Running the same command as last time, 'dtrace -n ::epair\*:' returns
the following:
```
CPU ID FUNCTION:NAME
…
0 66499 epair_transmit_locked:enqueued
```
Looks like its filled up a queue somewhere and is dropping connections
post that.
The value of the 'error' is 55 I can see both the ifp and m structs
but don't know what to look for in them.
That’s useful. Error 55 is ENOBUFS, which in IFQ_ENQUEUE() means
we’re hitting _IF_QFULL().
There don’t seem to be counters for that drop though, so that makes it
hard to diagnose without these extra probe points.
It also explains why you don’t really see any drop counters
incrementing.
The fact that this queue is full presumably means that the other side is
not reading packets off it any more.
That’s supposed to happen in epair_start_locked() (Look for the
IFQ_DEQUEUE() calls).
It’s not at all clear to my how, but it looks like the receive side is
not doing its work.
It looks like the IFQ code is already a fallback for when the netisr
queue is full.
That code might be broken, or there might be a different issue that will
just mean you’ll always end up in the same situation, regardless of
queue size.
It’s probably worth trying to play with
‘net.route.netisr_maxqlen’. I’d recommend *lowering* it, to see if
the problem happens more frequently that way. If it does it’ll be
helpful in reproducing and trying to fix this. If it doesn’t the full
queues is probably a consequence rather than a cause/trigger.
(Of course, once you’ve confirmed that lowering the netisr_maxqlen
makes the problem more frequent go ahead and increase it.)
Regards,
Kristof
_______________________________________________
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"