[ ... why you want something more than interrupt coelescing ... ]

Don Bowman wrote:
> Actually I have pushed it to the livelock case. I'm shocked at how
> easy this is to do with BSD (I'm used to system like vxworks with
> much lower over head interrupt processing).
> I found that for a 2x XEON @ 2GHz that I can achieve this @ ~100Mbps
> of minimal size UDP fragments. Tuning the driver dramatically improved
> the situation. Reducing the size of its receive ring to the proper
> amount also helps since it will then run out of buffers and
> drop packets.  This isn't extreme load, it isn't really particularly
> heavy load, its only like ~200Kpps. I suspect the defragmenting
> is the issue, so I tried it again with ARP's. This helped a lot.
> 
> I'm still not clear on how the receiver polling helps me, it also
> makes a constant rate consumption of packets. If I set the bds
> to the max, then I will only be interrupted @ constant rate by
> the device.

OK.  This really has nothing to do with interrupt processing
latency, except that such latency increases pool retention time,
and reduces the overall load-bearing capacity of a single system.
In other words, latency effects the number of connections and
the total amount of data in transit, but not whether or not that
data gets through.  So it's not a direct cause of livelock, even
if it can be an indirect cause.

There are a couple of livelock points.  The best paper describing
this is:

        Eliminating Receiver Livelock in an Interrupt-Driven Kernel
        Jeffrey C. Mogul, K.K. Ramakrishnan
        http://citeseer.nj.nec.com/mogul96eliminating.html

This isn't the earliest paper on the topic but it's authoritative.

All of these boil down to packets not getting all the way to the
application that has the socket open, for whaetver reason, or
the inability of the system to send responses to the packets which
it has received.

Basically, a receiver livelock can occur any place that there is
not a negative feedback loop to control packet sources, once you
achieve resource saturation.

In BSD network processing, there are a number of livelock points,
where there is no negative feedback loop.  Following the packet in
from the wire, these are:

o       Received packets are copied across the bus to a main
        memory ring buffer, even when the packets are not being
        processed, for some cards.  The correct thing to do is
        to not copy the packets in this case, reserving the bus
        for other data transfers (e.g. an NFS server can not do
        disk transfers if it is spending all bus cycles doing
        packet transfers).  This is a network controller firmware
        issue, and has to be addressed there (many, but not all,
        network adapters "do the right thing").

o       Hardware interrupts occur when packets are transferred
        successfully from network adapter memory to main memory,
        which causes the host system to run the interrupt handler
        code, instead of running other code, like the protocol
        processing or the application that owns the connection.
        This is the highest priority, so you need to be able to
        squelch interrupt processing.

o       Protocol processing is the next highest priority item; if
        you successfully squelch hardware interrupts when you
        hit load capacity, you can still deny applications time
        to run by, instead of spending all of your time handling
        interrupts, spending all your time doing protocol
        processing.  The application does not have an opportunity
        to run, to deal with the packets which have been received.

o       Application processing is the lowest priority item.  When
        you hit a resource limit, such as number of mbufs available
        in the system as a whole, such that you cannot allocate send
        chains for responding to the traffic requests you are getting,
        then you back up input requests, and, again, you are in
        trouble.

All of these lead to receiver livelock.

There are several ways to attack this issue, but given an "infinite"
ability to generate load, the only one that's effective at handling
at least *some* load is to insert negative feedback loops between the
stall points in the network processing model, so that the next stage
can squelch the incoming packets.

The normal way to deal with this is to drop the packets as early
as possible, before investing a lot of resources in processing them.
The best way (so far) of doing this is LRP ("Lazy Receiver Processing"),
which is described in:

        Lazy Receiver Processing (LRP): A Network Subsystem
                Architecture for Server Systems
        Peter Druschel, Gaurav Banga
        http://citeseer.nj.nec.com/druschel96lazy.html

This paper described more of the details of receiver livelock, and
the problems with the BSD processing model, and has a number of nice
pictures that I cab't do justice to with just "ASCII Art".


What DEVICE_POLLING does is addresses the squelching of incoming
packets at the hardware interrupt level, so that they are dropped
by the network adapter.

Depending on the adapter design, they may still be copied to host
memory from adapter memory, particularly when there is not an
explicit acknowledgement, just a ring buffer in host memory.  You
don't want to buy these network adapters, though you will be unlikely
to see the problem in an HTTP server, since it will serve most of its
content from cache, rather than across the device bus, unless it is a
streaming media server or serves very large static files.  8-).


In addition, the DEVICE_POLLING code contains scheduler modifications,
which permit you to reserve a certain amount of time for the applciation
to run, with interupts disabled, so that the applications can clear the
input queues, and send out responses.

In other words, DEVICE_POLLING addresses two of the five locations
where there is not a negative feedback loop, but does it in a
suboptimal way: by weighted round-robin timeshare of the main CPU.

In point of fact, you are more likely to hit resource limits, than
you are to hit livelock at the other points, unless you tune your
kernel for higher processing rates, at which point latency will fill
the pol, and you may hit one of the other 3 hard livelock conditions
(there is also the potential for interaction livelock, for example,
your application may be a streamcast application, and can't send to
any if one outbound channel is full, etc.).

Actually, you *want* to tune your system up, so that these become
issues, and then address them, as well.

For example, to avoid CPU starvation on a non-single-purpose server
host, where applciations are reserved more time than they need on
average, you can only approximate a best fit with DEVICE_POLLING,
by adjusting the scheduler modifications (they are roughly equal to
the SVR4 "fixed" scheduling class, where work is time division
multiplexed onto the CPU).  To address this, you really want to
deal with pushing data to the application on the basis of the queue
depts from kernel to user space: a WFQ ("Weighted Fair-Share Queueing")
mechanism, to ensure application runtime proportional to time needed
to handle a given load.


All that can come later.  For right now, since you don't have access
to my source tree (8-)), you should content yourself with DEVICE_POLLING.


-- Terry


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Reply via email to