Re: MPLS

Alexander V. Chernikov Fri, 22 Mar 2013 03:59:05 -0700

On 18.03.2013 17:41, Andre Oppermann wrote:

On 18.03.2013 13:20, Alexander V. Chernikov wrote:
On 17.03.2013, at 23:54, Andre Oppermann <an...@freebsd.org> wrote:
On 17.03.2013 19:57, Alexander V. Chernikov wrote:
On 17.03.2013 13:20, Sami Halabi wrote:
ITOH OpenBSD has a complete implementation of MPLS out of thebox, maybe
Their control plane code is mostly useless due to design approach(routing daemons talk via kernel).
What's your approach?
It is actually not mine. We have discussed this a bit inradix-related thread. Generally quagga/bird (and other hiperfhardware-accelerated and software routers) have feature-rich RIb fromwhich best routes (possibly multipath) are installed to kernel/fib.Kernel main task should be to do efficient lookups while every otheradvanced feature should be implemented in userland.
Yes, we have started discussing it but haven't reached a conclusionamong thetwo philosophies. We have also agreed that the current radix code ishorriblein terms of cache misses per lookup. That however doesn't preclude anagnosticFIB+RIB approach. It's mostly a matter of structure layout to keep itefficient.

Yes. Additionally, we have problems with misuse of rtalloc api (rtegrabbing).My point of view is to use separate FIB for 'data plane' e.g. forwardingwhile keeping some kind of higher-levelkernel RIB used for route socket, multipath, other subsystem intercationand so on.

Their data plane code, well.. Yes, we can use some defines fromtheir headers, but that's all :)
porting it would be short and more straight forward than portinglinux LDP
implementation of BIRD.
It is not 'linux' implementation. LDP itself is cross-platform.
The most tricky place here is control plane.
However, making _fast_ MPLS switching is tricky too, since itrequires chages in our netisr/ethernet
handling code.
Can you explain what changes you think are necessary and why?
>
We definitely need ability to dispatch chain of mbufs - this wasalready discussed in intel rx ring lock thread in -net.
Actually I'm not so convinced of that. Packet handling is a tradeoffbetween

Yes. But I'm talking on mixed way (one part as batches, to eliminatecontentions, and than - process-to-completion)

doing process-to-completion on each packet and doing context switcheson batches
of packets.

Context switches?

Batches are efficient:
it is noted explicitly in

1) Luigi's VALE paperhttp://info.iet.unipi.it/~luigi/papers/20121026-vale.pdf (Section 5.2)2) Intel/6wind uses batches to move packets to their 'netisr' rings intheir DPDK

3) PacketShader ( http://shader.kaist.edu/packetshader/ ) uses batches too.

Every few years the balance tilts forth and back betweenprocess-to-completionand batch processing. DragonFly went with a batch-lite token-passingapproachthroughout their kernel. It seems it didn't work out to the extentthey expected.

There are other, more successful solutions with _much_ better results(100x faster that our code, for example).

Now many parts are moving back to the more traditional locking approach.
Currently significant number of drivers support interrupt moderationpermitting several/tens/hundreds of packets to be received on interrupt.
But they've also started to provide multiple queues.

Yes, but hashing function is pre-defined, and bursty flows can stillfall on single CPU.

For each packet we have to run some basic checks, PFIL hooks, netisrcode, l3 code resulting in many locks being acquired/released pereach packet.
Right, on the other hand you'll likely run into serious interlock andlatencyissues when large batches of packets monopolize certain lockspreventing other
interfaces from sending their batches up.
Typically we rely on NIC to put packet in given queue (direct isr),which works bad for non-hashable types of traffic like gre, PPPoE,MPLS. Additionally, hashing function is either standard (from M$NDIS) or documented permitting someone malicious to generate'special' traffic matching single queue.
Malicious traffic is always a problem, no matter how many queues youhave.
Currently even if we can add m2flowid/m2cpu function able to hash,say, gre or MPLS, it is unefficient since we have to lock/unlocknetisr queues for every packet.
Yes, however I'm arguing that our locking strategy may be broken orsub-optimal.

Various solution reports, say, 50MPPS (or scalable 10-15MPPS per-core)of IPv4 forwarding. Currenty stock kernel can do ~1MPPS. There are, ofcourse, other reasons (structure alignment, radix, arp code), but thedifference is _too_ huge.

I'm thinking of
* utilizing m_nextpkt field in mbuf header
OK.  That's what it is there for.
* adding some nh_chain flag to netisr
If given netisr does not support flag and nextpkt is not null wesimply call such netisr in cycle.* netisr hash function accepts mbuf 'chain' and pointer to array(Sizeof N * ptr), sorts mbuf to N netisr queues saving list heads tosupplied array. After that we put given lists to appropriate queues.
* teach ethersubr RX code to deal with mbuf chains (not easy one)
* add some partial support of handling chains to fastfwd code
I really don't think this going to help much. You're just adding alot of
latency and context switches to while packet path.  Also you're making it
much more complicated.
The interface drivers and how they manage the boundary between RX ringandthe stack is not optimal yet. I think there's a lot of potentialthere. Inmy tcp_workqueue branch I started to experiment with a couple ofapproaches.
It's not complete yet though.
The big advantage of having the interface RX thread pushing thepackets is
that it provides a natural feedback loop regarding system load. Once you
have more packets coming in than you can process, the RX dma ring gets
naturally starved and the load is stabilized on the input side preventing

I see no difference here.


a live-lock that can easily happen in batch mode.  Only a well-adjusted
driver works properly though and we don't have any yet in that regard.

That's true.

Before we start to invent complicated mbuf batching methods lets makesure
that the single packet path at its maximal possible efficiency. And only
then evaluate more complicated approaches on whether they deliveradditional
gains.

From that follows that we should:

 1. fix longest prefix match radix to minimize cache misses.

Personally I think that 'fix' is rewriting this entirely..

For example, common solution for IPv6 lookup is to use the fact that youhave either /64 or wider routes, or host route, so radix lookup is doneon first 64 bit (and there can be another tree in given element if thereare several more specific routes).That's why I'm talking on RIB/fib approach (one common 'academic'implementation for control, and family-depended effective lookup code).

We also need to fix fundamental rte usage paradigm, currently it isunsafe for both ingress and egress interface..


 2. fix drivers to optimize RX dequeuing and TX enqueuing.

 3. have a critical look at other parts of the packet path to avoid

For IPv4 fastforwarding we have:

1) RX ring mtx lock, (BPF rlock) (L3 PFIL in, FW lock), Ifaddr RLOCK,Radix Rlock, rte mtx_lock (twice by default), (L3 PFIL out, FW lock),ARP rlock, ARP entry rlock, TX ring lock?

(And +2  rlocks for VLAN, and another 2 for LAGG)

There was RX ring lock/unlock thead for 8299 ended with nothing.
I've changed BPF 2 mtx_lock to be 1 RLOCK

There are patches permitting IPFW to use PFIL lock (however they are notcommitted due to possibility that we can make lockless PFIL)

I've removed 1 rte mtx_lock (dynamic routes, turned off if forwarding is ON)

I'm working on ARP stack rewrite to (first stage to enable L2 multipath(lagg-aware) and remove ARP entry lock from forwarding path), secondstage - to store pointer to full L2 prepend header to rte (yes, that wasremoved in 7.x)).

I'm a bit stuck of other ideas to eliminate remaining locks (except ofsimply moving forwarding code to userland netmap-based solution likeothers do)



_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: MPLS

Reply via email to