Re: Changes in the network interface queueing handoff model

Sam Leffler Sun, 30 Jul 2006 18:03:01 -0700

Robert Watson wrote:

On Sun, 30 Jul 2006, Sam Leffler wrote:
I have a fair amount of experience with the linux model and it worksok. The main complication I've seen is when a driver needs to processmultiple queues of packets things get more involved. This is seen in802.11 drivers where there are two q's, one for data frames and onefor management frames. With the current scheme you have two separatequeues and the start method handles prioritization by polling the mgtq before the data q. If instead the packet is passed to the startmethod then it needs to be tagged in some way so the it's prioritizedproperly. Otherwise you end up with multiple start methods; one pertype of packet. I suspect this will be ok but the end result will bethat we'll need to add a priority field to mbufs (unless we pass it asan arge to the start method).
All this is certainly doable but I think just replacing one mechanismwith the other (as you specified) is insufficient.
Hmm. This is something that I had overlooked. I was loosely aware thatthe if_sl code made use of multiple queues, but was under the impressionthat the classification to queues occured purely in the SLIP code.Indeed, it does, but structurally, SLIP is split over the link layer(if_output) and driver layer (if_start), which I had forgotten. I takeit from your comments that 802.11 also does this, which I was not aware of.

There are several issues here but the basic one is, I believe, that weneed to provide a per-packet notion of priority or TOS handling. Thedistinction between mgt frame and data in 802.11 drivers is a kludge;the right thing is to just use priority to get the desired effect. Butseparately 802.11 is aware of priority for WME so independent of mgtframes priority we still need a way to pass down an AC (access control).For 802.11 I was able to do this by encoding the value in the mbufflags. If there were a field in the mbuf header this kludge could beremoved. For other devices we still want a way to pass around theDiffServ bits or similar so things like vlan priority can be set w/oresorting to tagging each frame. Ideally prioritization work likewhat's done inside slip should be pulled out.

Note that just slapping a field in the mbuf is a start but we also needto think about how to handle it up+down the stack so layers can honorexisting priorty and/or filling priority for packets that aren't alreadyclassified.

I'm a little uncomfortable with our current m_tag model, as it requiressignificant numbers of additional allocations and frees for each packet,as well as walking link lists. It's fine for occasional discretionaryuse (i.e., MAC labels), but I worry about cases where it is used withevery packet, and we start seeing moderately non-zero numbers of tags onevery packet. I think I would be more comfortable with an explicitqueue identifier argument to if_start, where the link layer and driverlayer agree on how to identify queues.
As a straw man, how would the following strike you:

    int    if_startmbuf(struct ifnet *ifp, struct mbuf *m, int ifqid);
where for most link layers, the value would be zero, but for some linklayer/driver combinations, it would identify a specific queue which thelink layer believes the mbuf should be assigned, if implemented?

mbuf tags are not a solution; too expensive. I think we need somethingin the mbuf header.

Attached is a patch that maintains the current if_start, but addsif_startmbuf. If a device driver implements if_startmbuf and theglobal sysctl net.startmbuf_enabled is set to 1, then theif_startmbuf path in the driver will be used. Otherwise, if_start isused. I have modified the if_em driver to implement if_startmbufalso. If there is no packet backlog in the if_snd queue, it directlyplaces the packet in the transmit descriptor ring. If there is abacklog, it uses the if_snd queue protected by driver mutex, ratherthan a separate ifq mutex.
In some basic local micro-benchmarks, I saw a 5% improvement in UDP0-byte paylod PPS on UP, and a 10% improvement on SMP. I saw a 1.7%performance improvement in the bulk serving of 1k files over HTTP.These are only micro-benchmarks, and reflect a configuration in whichthe CPU is unable to keep up with the output rate of the 1gbpsethernet card in the device, so reductions in host CPU usage areimmediately visible in increased output as the CPU is able to betterkeep up with the network hardware. Other configurations are also ofinterest of interesting, especially ones in which the network deviceis unable to keep up with the CPU, resulting in more queueing.
Conceptual review as well as banchmarking, etc, would be most welcome.
Why is the startmbuf knob global and not per-interface? Seems likeyou want to convert drivers one at a time?
I may have under-described what I have implemented. The decision iscurrently made based on two factors: a global frob, and per-interfacedefinition of if_startmbuf being non-zero. The global frob is intendedto make it easy to benchmark the difference. I should modify the patchso that the global frob doesn't override the driver back to if_start inthe event that if_startmbuf is defined and if_start isn't. The globalfrob is intended to be removed in the long run, and I intend for us tocontinue to support both the old and new start methods for theforseeable future, since I don't intend to update every device driver wehave to the new method, at least not personally :-).
FWIW the original model was driven by the expectation that you couldraise the spl so the tx path was entirely synchronized from above.With the SMPng work we're synchronizing transfer through each controllayer. If the driver softc lock (or similar) were exposed to upperlayers we could possibly return the "lock the tx path" model we hadbefore and eliminate all the locking your changes target. But thatwould be a big layering violation and would add significant contentionin the SMP case.
In some ways, what I propose comes to much the same thing: the change Ipropose basically delegates the queueing and synchronization decisionsto the device driver, which might choose either to use the lock alreadyin the ifq, to use its own lock, or to use some other synchronizationstrategy. In the case of if_em, I've implemented bypass of softwarequeueing entirely in the common case, but in the event that the hardwarering backs up, then we still fall back to the if_snd queue, only we lockit using the device driver's transmit path mutex. Delegating thesynchronization down the stack comes with risks, as device driverwriters will inevitably take liberties: on the other hand, it appearsthat devices are quite diverse, and those liberties have advantages.
I think the key observation is that most network hardware today takespackets directly from private queues so the fast path needs to pushthings down to those queues w/ minimal overhead. This includesdevices that implement QoS in h/w w/ multiple queues.
Yes -- however, you're right that the link layer needs to be able topass more information down. I'd like it to be able to do so without anm_tag allocation, though, which suggests (as you point out) an explicitargument to if_startmbuf.


Or an addition to the mbuf header.

        Sam
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: Changes in the network interface queueing handoff model

Reply via email to