On 18.03.2013 17:41, Andre Oppermann wrote:
On 18.03.2013 13:20, Alexander V. Chernikov wrote:
On 17.03.2013, at 23:54, Andre Oppermann <an...@freebsd.org> wrote:
On 17.03.2013 19:57, Alexander V. Chernikov wrote:
On 17.03.2013 13:20, Sami Halabi wrote:
ITOH OpenBSD has a complete implementation of MPLS out of the box, maybe
Their control plane code is mostly useless due to design approach (routing daemons talk via kernel).
What's your approach?
It is actually not mine. We have discussed this a bit in radix-related thread. Generally quagga/bird (and other hiperf hardware-accelerated and software routers) have feature-rich RIb from which best routes (possibly multipath) are installed to kernel/fib. Kernel main task should be to do efficient lookups while every other advanced feature should be implemented in userland.
Yes, we have started discussing it but haven't reached a conclusion 
among the
two philosophies.  We have also agreed that the current radix code is 
horrible
in terms of cache misses per lookup.  That however doesn't preclude an 
agnostic
FIB+RIB approach.  It's mostly a matter of structure layout to keep it 
efficient.
Yes. Additionally, we have problems with misuse of rtalloc api (rte grabbing). My point of view is to use separate FIB for 'data plane' e.g. forwarding while keeping some kind of higher-level kernel RIB used for route socket, multipath, other subsystem intercation and so on.
Their data plane code, well.. Yes, we can use some defines from their headers, but that's all :)
porting it would be short and more straight forward than porting linux LDP
implementation of BIRD.
It is not 'linux' implementation. LDP itself is cross-platform.
The most tricky place here is control plane.
However, making _fast_ MPLS switching is tricky too, since it requires chages in our netisr/ethernet
handling code.
Can you explain what changes you think are necessary and why?
>
We definitely need ability to dispatch chain of mbufs - this was already discussed in intel rx ring lock thread in -net.
Actually I'm not so convinced of that.  Packet handling is a tradeoff 
between
Yes. But I'm talking on mixed way (one part as batches, to eliminate contentions, and than - process-to-completion)
doing process-to-completion on each packet and doing context switches on batches
of packets.
Context switches?

Batches are efficient:
it is noted explicitly in
1) Luigi's VALE paper http://info.iet.unipi.it/~luigi/papers/20121026-vale.pdf (Section 5.2) 2) Intel/6wind uses batches to move packets to their 'netisr' rings in their DPDK
3) PacketShader ( http://shader.kaist.edu/packetshader/ ) uses batches too.

Every few years the balance tilts forth and back between 
process-to-completion
and batch processing.  DragonFly went with a batch-lite token-passing 
approach
throughout their kernel.  It seems it didn't work out to the extent 
they expected.
There are other, more successful solutions with _much_ better results (100x faster that our code, for example).
Now many parts are moving back to the more traditional locking approach.

Currently significant number of drivers support interrupt moderation permitting several/tens/hundreds of packets to be received on interrupt.
But they've also started to provide multiple queues.
Yes, but hashing function is pre-defined, and bursty flows can still fall on single CPU.
For each packet we have to run some basic checks, PFIL hooks, netisr code, l3 code resulting in many locks being acquired/released per each packet.
Right, on the other hand you'll likely run into serious interlock and 
latency
issues when large batches of packets monopolize certain locks 
preventing other
interfaces from sending their batches up.

Typically we rely on NIC to put packet in given queue (direct isr), which works bad for non-hashable types of traffic like gre, PPPoE, MPLS. Additionally, hashing function is either standard (from M$ NDIS) or documented permitting someone malicious to generate 'special' traffic matching single queue.
Malicious traffic is always a problem, no matter how many queues you 
have.
Currently even if we can add m2flowid/m2cpu function able to hash, say, gre or MPLS, it is unefficient since we have to lock/unlock netisr queues for every packet.
Yes, however I'm arguing that our locking strategy may be broken or 
sub-optimal.
Various solution reports, say, 50MPPS (or scalable 10-15MPPS per-core) of IPv4 forwarding. Currenty stock kernel can do ~1MPPS. There are, of course, other reasons (structure alignment, radix, arp code), but the difference is _too_ huge.
I'm thinking of
* utilizing m_nextpkt field in mbuf header
OK.  That's what it is there for.

* adding some nh_chain flag to netisr
If given netisr does not support flag and nextpkt is not null we simply call such netisr in cycle. * netisr hash function accepts mbuf 'chain' and pointer to array (Sizeof N * ptr), sorts mbuf to N netisr queues saving list heads to supplied array. After that we put given lists to appropriate queues.
* teach ethersubr RX code to deal with mbuf chains (not easy one)
* add some partial support of handling chains to fastfwd code
I really don't think this going to help much.  You're just adding a 
lot of
latency and context switches to while packet path.  Also you're making it
much more complicated.

The interface drivers and how they manage the boundary between RX ring and the stack is not optimal yet. I think there's a lot of potential there. In my tcp_workqueue branch I started to experiment with a couple of approaches.
It's not complete yet though.

The big advantage of having the interface RX thread pushing the packets is
that it provides a natural feedback loop regarding system load. Once you
have more packets coming in than you can process, the RX dma ring gets
naturally starved and the load is stabilized on the input side preventing
I see no difference here.
a live-lock that can easily happen in batch mode.  Only a well-adjusted
driver works properly though and we don't have any yet in that regard.
That's true.

Before we start to invent complicated mbuf batching methods lets make sure
that the single packet path at its maximal possible efficiency. And only
then evaluate more complicated approaches on whether they deliver additional
gains.

From that follows that we should:

 1. fix longest prefix match radix to minimize cache misses.
Personally I think that 'fix' is rewriting this entirely..
For example, common solution for IPv6 lookup is to use the fact that you have either /64 or wider routes, or host route, so radix lookup is done on first 64 bit (and there can be another tree in given element if there are several more specific routes). That's why I'm talking on RIB/fib approach (one common 'academic' implementation for control, and family-depended effective lookup code).
We also need to fix fundamental rte usage paradigm, currently it is 
unsafe for both ingress and egress interface..
 2. fix drivers to optimize RX dequeuing and TX enqueuing.

 3. have a critical look at other parts of the packet path to avoid
For IPv4 fastforwarding we have:
1) RX ring mtx lock, (BPF rlock) (L3 PFIL in, FW lock), Ifaddr RLOCK, Radix Rlock, rte mtx_lock (twice by default), (L3 PFIL out, FW lock), ARP rlock, ARP entry rlock, TX ring lock?
(And +2  rlocks for VLAN, and another 2 for LAGG)

There was RX ring lock/unlock thead for 8299 ended with nothing.
I've changed BPF 2 mtx_lock to be 1 RLOCK
There are patches permitting IPFW to use PFIL lock (however they are not committed due to possibility that we can make lockless PFIL)
I've removed 1 rte mtx_lock (dynamic routes, turned off if forwarding is ON)
I'm working on ARP stack rewrite to (first stage to enable L2 multipath (lagg-aware) and remove ARP entry lock from forwarding path), second stage - to store pointer to full L2 prepend header to rte (yes, that was removed in 7.x)).
I'm a bit stuck of other ideas to eliminate remaining locks (except of 
simply moving forwarding code to userland netmap-based solution like 
others do)


_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Reply via email to