Re: Latency issues with buf_ring

Andre Oppermann Sun, 18 Nov 2012 13:50:27 -0800

On 18.11.2012 18:24, Barney Cordoba wrote:



--- On Thu, 1/19/12, John Baldwin <j...@freebsd.org> wrote:

From: John Baldwin <j...@freebsd.org>
Subject: Latency issues with buf_ring
To: n...@freebsd.org
Cc: "Ed Maste" <ema...@freebsd.org>, "Navdeep Parhar" <n...@freebsd.org>
Date: Thursday, January 19, 2012, 11:41 AM
The current buf_ring usage in various
NIC drivers has a race that can
result in really high packet latencies in some cases.
Specifically,
the common pattern in an if_transmit routine is to use a
try-lock on
the queue and if that fails enqueue the packet in the
buf_ring and
return.  The race, of course, is that the thread
holding the lock
might have just finished checking the buf_ring and found it
empty and
be in the process of releasing the lock when the original
thread fails
the try lock.  If this happens, then the packet queued
by the first
thread will be stalled until another thread tries to
transmit packets
for that queue.  Some drivers attempt to handle this
race (igb(4)
schedules a task to kick the transmit queue if the try lock
fails) and
others don't (cxgb(4) doesn't handle it at all).  At
work this race
was triggered very often after upgrading from 7 to 8 with
bursty
traffic and caused numerous problems, so it is not a rare
occurrence
and needs to be addressed.

(Note, all patches mentioned are against 8)

The first hack I tried to use was to simply always lock the
queue after
the drbr enqueue if the try lock failed and then drain the
queue if
needed (www.freebsd.org/~jhb/patches/try_fail.patch).
While this fixed
my latency problems, it would seem that this breaks other
workloads
that the drbr design is trying to optimize.

After further hacking what I came up with was a variant of
drbr_enqueue()
that would atomically set a 'pending' flag.  During the
enqueue operation.
The first thread to fail the try lock sets this flag (it is
told that it
set the flag by a new return value (EINPROGRESS) from the
enqueue call).
The pending thread then explicitly clears the flag once it
acquires the
queue lock.  This should prevent multiple threads from
stacking up on the
queue lock so that if multiple threads are dumping packets
into the ring
concurrently all but two (the one draining the queue
currently and the
one waiting for the lock) can continue to drain the
queue.  One downside
of this approach though is that each driver has to be
changed to make
an explicit call to clear the pending flag after grabbing
the queue lock
if the try lock fails.  This is what I am currently
running in production
(www.freebsd.org/~jhb/patches/try_fail3.patch).

However, this still results in a lot of duplicated code in
each driver
that wants to support multiq.  Several folks have
expressed a desire
to move in a direction where the stack has explicit
knowledge of
transmit queues allowing us to hoist some of this duplicated
code out
of the drivers and up into the calling layer.  After
discussing this a
bit with Navdeep (np@), the approach I am looking at is to
alter the
buf_ring code flow a bit to more closely model the older
code-flow
with IFQ and if_start methods.  That is, have the
if_transmit methods
always enqueue each packet that arrives to the buf_ring and
then to
call an if_start-like method that drains a specific transmit
queue.
This approach simplifies a fair bit of driver code and means
we can
potentially move the enqueue, etc. bits up into the calling
layer and
instead have drivers provide the per-transmit queue start
routine as
the direct function pointer to the upper layers ala
if_start.

However, we would still need a way to close the latency
race.  I've
attempted to do that by inverting my previous 'thread
pending' flag.
Instead, I make the buf_ring store a 'busy' flag.  This
flag is
managed by the single-consumer buf_ring dequeue method
(that
drbr_dequeue() uses).  It is set to true when a packet
is removed from
the queue while there are more packets pending.
Conversely, if there
are no other pending packets then it is set to false.
The assumption
is that once a thread starts draining the queue, it will not
stop
until the queue is empty (or if it has to stop for some
other reason
such as the transmit ring being full, the driver will
restart draining
of the queue until it is empty, e.g. after it receives a
transmit
completion interrupt).  Now when the if_transmit
routine enqueues the
packet, it will get either a real error, 0 if the packet was
enqueued
and the queue was not idle, or EINPROGRESS if the packet was
enqueued
and the queue was busy.  For the EINPROGRESS case the
if_transmit
routine just returns success.  For the 0 case it does a
blocking lock
on the queue lock and calls the queue's start routine (note
that this
means that the busy flag is similar to the old OACTIVE
interface
flag).  This does mean that in some cases you may have
one thread that
is sending what was the last packet in the buf_ring holding
the lock
when another thread blocks, and that the first thread will
see the new
packet when it loops back around so that the second thread
is wasting
it's time spinning, but in the common case I believe it will
give the
same parallelism as the current code.  OTOH, there is
nothing to
prevent multiple threads from "stacking up" in the new
approach.  At
least the try_fail3 patch ensured only one thread at a time
would ever
potentially block on the queue lock.

Another approach might be to replace the 'busy' flag with
the 'thread
pending' flag from try_fail3.patch, but to clear the 'thread
pending'
flag anytime the dequeue method is called rather than using
an
explicit 'clear pending' method.  (Hadn't thought of
that until
writing this e-mail.)  That would prevent multiple
threads from
waiting on the queue lock perhaps.

Note that the 'busy' approach (or the modification I
mentioned above)
does rely on the assumption I stated above, i.e. once a
driver starts
draining a queue, it will drain it until empty unless it
hits an
"error" condition (link went down, transmit ring full,
etc.).  If it
hits an "error" condition, the driver is responsible for
restarting
transmit when the condition clears.  I believe our
drivers already
work this way now.

The 'busy' patch is at http://www.freebsd.org/~jhb/patches/drbr.patch

--
John Baldwin


Q1: Has this been corrected?


I don't know.

Q2: Are there any case studies or benchmarks for buf_ring, or it is just
blindly being used because someone claimed it was better and offered it
for free? One of the points of locking is to avoid race conditions, so

> the fact that you have races in a supposed lock-less scheme seems more
> than just ironic.

At the moment our entire ifnet/driver handoff is not optimal.  Especially
the double buffering in the DMA rings and the IFQ (be it the old one or
buf_ring) is troubling.  See "Buffer Bloat" for one aspect of it.  While
hacking up a couple of drivers for automatic irq/polling experimenting
I've come up with a couple of ideas in this area that I'm implementing
right now.  When it's ready meaningful benchmarks will be run to assess
the impact.

--
Andre

_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Latency issues with buf_ring

Reply via email to