On Thu, 17 Apr 2008, Alexander Sack wrote:

On Wed, Apr 16, 2008 at 10:53 PM, Bruce Evans <[EMAIL PROTECTED]> wrote:
On Wed, 16 Apr 2008, Alexander Sack wrote:

[DEVICE_POLLING]
But why was it added to begin with if standard interrupt driven I/O is
faster?  (was it the fact that historically hardware didn't do
interrupt coalescing initially)

See Robert's reply.

However, my point still stands:

#define TG3_RX_RCB_RING_SIZE(tp) ((tp->tg3_flags2 &
TG3_FLG2_5705_PLUS) ?  512 : 1024)

Even the Linux driver uses higher number of RX descriptors than
FreeBSD's static 256.  I think minimally making this tunable is a fair
approach.

If not, no biggie, but I think its worth it.

 I use a fixed value of 512 (jkim gave a pointer to old mail containing
 a fairly up to date version of my patches for this and more important
 things).  This should only make a difference with DEVICE_POLLING.

Then minimally 512 could be used if DEVICE_POLLING is enabled and my
point still stands.  Though in light of the other statistics you cited
I understand now that this may not make that big of an impact.

em uses only 256 too (I misread it as using 2048).  Someone reported that
increasing this to 4096 reduced packet loss with polling.

 Without DEVICE_POLLING, device interrupts normally occur every 150 usec

Is that the coal ticks value you are referring too?  Sorry this is my
first time looking at this driver!

Yes, the driver normally configures coal ticks as 150.  This is a good
default.

 or even more frequently (too frequently if the average is much lower
 than 150 usec), so 512 descriptors is more than enough for 1Gbps ethernet
 (the minimum possible inter-descriptor time for tiny packets is about 0.6
 usec,

How do you measure this number?

0.6 usec is the theoretical minimum.  I actually measure a minumum of
about 1.5 usec for my hardware (5701 PCI/X on plain PCI) by making
timestamps in bge_rxeof() and bge_txeof().  (1.5 usec is the average
for a ring full of descriptors.)

I'm assuming when you say "inter-descriptor time" you mean the time it
takes the card to fill a RX descriptor on receipt of a packet (really
the firmware latency?).

No, it is part of the Ethernet spec (96 bit times for all speeds of
Ethernet IIRC, so it is much shorter than it was for original Ethernet).
At least my hardware takes significantly longer than this (1.5 - 0.6
usec = 900 nsec!).  It is unclear where the extra time is spent, but
presumably the hardware implements the Ethernet spec and is limited
mainly by the bus speed (if the bus is plain PCI, otherwise DMA speed
might be the limit), so if packets arrived every 0.6 usec then it
would buffer many of them in fast device memory and then be forced to
drop 9 in every 15 on average.

 For timeouts instead of device polls, at least on old systems it was
 quite common for timeouts at a frequency of HZ not actually being
 delivered, even when HZ was only 100, because some timeouts run for
 too long (several msec each, possibly combining to > 10 msec occasionally).
 Device polls are at a lower level, so they have a better chance of
 actually keeping up with HZ.  Now the main source of timeouts that run
 for too long is probably mii tick routines.  These won't combine, at
 least for MPSAFE drivers, but they will block both interrupts and
 device polls for their own device.  So the rx ring size needs to be
 large enough to cover max(150 usec or whatever interrupt moderation time,
 mii tick time) of latency plus any other latencies due to interrupt
handling
 or polling of for other devices.  Latencies due to interrupts on other
 devices is only certain to be significant if the other devices have higher
 or the same priority.

You described what I'm seeing.  Couple this with the fact that the
driver uses one mtx for everything doesn't help.  I'm pretty sure I'm
running into RX descriptor starvation despite the fact that
statistically speaking, 256 descriptors is enough for 1Gbps (I'm
talking 100MBps the firmware is dropping packets).  The problem gets
worse if I add some kind of I/O workload on the system (my load
happens to be a gzip of a large log file in /tmp).

I haven't found the mii tick latency to be a problem in practice, though
I once suspected it.  Oh, I just remembered that this requires working
PREEMPTION so that lower-priority interrupt handlers like ata and sc get
preempted.  PREEMPTION wasn't the default and didn't work very well until
relatively recently.  But I think it works in 7.0.

I noticed that if I put ANY kind of debugging messages in bge_tick()
the drop gets much worse (for example just printing out the number of
dropped packets read from bge_stats_update() when a drop occurs causes
EVEN more drops to incur and if I had to guess its the printf just
uses up more cycles which delays the drain of RX chain and causes a
longer time to recover - this is a constant stream from a traffic
generator).

Delays while holding the lock will cause problems of course.  Hmm,
bge_tick() is a callout, so it may itself be delayed or preempted.
Delaying it shouldn't matter, and latency from preempting it is
supposed to be handled by priority propagation:

        callout ithread runs
        calls bge_tick()
        acquires device mutex
        ...
                preempted by unrelated ithread
                ...
                        preempted by bge ithread
                        tries to acquire device mutex; blocks
                        bge ithread priority is propagated to callout ithread
                preempted by callout ithread
        ... // now it is high priority; should be more careful not to take long
        releases device mutex; loses its propagated priority
                        preempted by bge ithread
                        acquires device mutex
                        ...


 Some numbers for [1 Gbps] ethernet:

 minimum frame size = 64 bytes =    512 bits
 minimum inter-frame gap =           96 bits
 minimum total frame time =         608 nsec (may be off by 64)
 bge descriptors per tiny frame   = 1 (1 for mbuf)
 buffering provided by 256 descriptors = 256 * 608 = 155.648 usec (marginal)

So as I read this, its takes 155 usec to fill up the entre RX chain of
rx_bd's if its just small packets, correct?

At least that long, depending on bus and DMA speeds.

 normal frame size = 1518 bytes = 12144 bits
 normal total frame time =        12240 nsec
 bge descriptors per normal frame = 2 (1 for mbuf and 1 for mbuf cluster)
 buffering provided by 256 descriptors = 256/2 * 12240 = 1556.720 usec
(plenty)

Is this based again on your own instrumentation based on the last
patch?  (just curious, I believe you, I just wanted to know if this
was an artifact of you doing some tuning research or something else)

This is a theoretical minimum too, but in practice even a PCI bus can
almost keep up with 1Gbps ethernet in 1 direction, so I've measured
average packet rates of > 81 kpps for normal frames (81 kpps = 12345
nsec per packet).  Timestamps made in bge_rxeof() at a rate of only
62.7 kpps (since my em card can't go faster than this) look like this:

%%%
  97 1208479322.632804  13   0   7 1208479322.632804   6
 104 1208479322.632908  11   0   6 1208479322.632908   5
 105 1208479322.633013   9   1   5 1208479322.633014   4
  64 1208479322.633078  10   0   4 1208479322.633078   6
  95 1208479322.633173  11   1   6 1208479322.633174   5
%%%

Here the columns give:
1st: time in usec between bge_rxeof() calls
4th: time in usec taken by this cal
5th: number of descriptors processed by this call
other: raw timestamps and ring indexes

The inter-rxeof time is ~100 usec since rx_coal_ticks is configured to 100.
Thus there are only a few packets per interrupt at the "low" rate of 62.7 kpps.
There are no latency problems in sight in this truncated output.

This output is inconsistent with what I said above -- there is no sign of
the factor of 2 for the mbuf+cluster split.  I now think that that split
only affects output.

So the million dollar question:  Do you believe that if I disable
DEVICE_POLLING and use interrupt driven I/O, I could achieve zero
packet loss over a 1Gbps link?  This is the main issue I need to solve
(solve means either no its not really achievable without a heavy
rewrite of the driver OR yes it is with some tuning).  If the answer
is yes, then I have to understand the impact on the system in general.
I just want to be sure I'm on a viable path through the BGE maze!

I think you can get close enough if the bus and memory and CPU(s)
permit and you don't need to get too close to the theoretical limits.

Bruce
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to