Andrew Gallatin wrote:
Scott Long [EMAIL PROTECTED] wrote:

scottl      2006-01-11 00:30:25 UTC

 FreeBSD src repository

 Modified files:
sys/dev/em if_em.c if_em.h Log:
 Significant performance improvements for the if_em driver:


Very cool.


 - If possible, use a fast interupt handler instead of an ithread handler.  Use
   the interrupt handler to check and squelch the interrupt, then schedule a
   taskqueue to do the actual work.  This has three benefits:
   - Eliminates the 'interrupt aliasing' problem found in many chipsets by
     allowing the driver to mask the interrupt in the NIC instead of the
     OS masking the interrupt in the APIC.


Neat.  Just like Windows..

<....>


   - Don't hold the driver lock in the RX handler.  The handler and all data
     associated is effectively serialized already.  This eliminates the cost of
     dropping and reaquiring the lock for every receieved packet.  The result
     is much lower contention for the driver lock, resulting in lower CPU usage
     and lower latency for interactive workloads.


This seems orthogonal to using a fastintr/taskqueue, or am I missing something?

Assuming a system where interrupt aliasing is not a problem, how much
does using a fastintr/taskqueue change interrupt latency as compared
to using an ithread?  I would (naively) assume that using an ithread
would be faster & cheaper.  Or is disabling/enabling interrupts in the
apic really expensive?


Touching the APIC is tricky. First, you have to pay the cost of a spinlock. Then you have to may the cost of at least one read and write across the FSB. Even though the APIC registers are memory mapped, they are still uncached. It's not terribly expensive, but it does add up.
Bypassing this and using a fast interrupt means that you pay the cost of
1 PCI read, which you would have to do anyways with either method, and 1 PCI write, which will be posted at the host-pci bridge and thus only as expensive as an FSB write. Overall, I don't think that the cost difference is a whole lot, but when you are talking about thousands of interrupts per second, especially if multiple interfaces are running under load, it might be important. And the 750x and 752x chipsets are
so common that it is worthwhile to deal with them (and there are reports
that the aliasing problem is happening on more chipsets than just these now).

As for latency, the taskqueue runs at the same PI_NET priority as an the
ithread would. I thought that there was an optimization on some platforms to encourage quick preemption for ithreads when they are scheduled, but I can't find it now. So, the taskqueue shouldn't be all
that different from an ithread, and it even means that there won't be
any sharing between instances even if the interrupt vector is shared.

Another advantage is that you get adaptive polling for free.  Interface
polling only works well when you have a consistently high workload.  For
spikey workloads, you do get higher latency at the leading edge of the
spike since the polling thread is asleep while waiting for the next
tick.  Trying to estimate workload and latency in the polling loop is a
pain, while letting the hardware trigger you directly is a whole lot
easier.

However, taskqueues are really just a proof of concept for what I really
want, which is to allow drivers to register both a fast handler and an
ithread handler.  For drivers doing this, the ithread would be private
to the driver and would only be activated if the fast handler signals
it.  Drivers without fast handlers would still get ithreads that would
still act the way they do now.  If an interrupt vector is shared with
multiple handlers, the fast handlers would all get run, but the only
ithreads that would run would be for drivers without a fast handler and
for drivers that signaled for it to run from the fast handler.  Anyways,
John and I have discussed this quite a bit over the last year, we just
need time to implement it.

Do you have a feel for how much of the increase was do to the other
changes (rx lock, avoiding register reads)?

Both of those do make a difference, but I didn't introduce them into
testing until Andre had already done some tests that showed that the
taskqueue helped.  I don't recall what the difference was, but I think
it was in low 10% range.  Another thing that I want to do is to get the
tx-complete path to run without a lock.  For if_em, this means killing
the shortcut in en_encap of calling into it to clean up the tx ring.
it also means being careful with updating and checking the tx ring
counters between the two sides of the driver.  But if it can be made to
work then almost all top/bottom contention in the driver can be
eliminated.


Scott
_______________________________________________
cvs-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/cvs-all
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to