On 18.03.2013 17:41, Andre Oppermann wrote:
On 18.03.2013 13:20, Alexander V. Chernikov wrote:
On 17.03.2013, at 23:54, Andre Oppermann <an...@freebsd.org> wrote:
On 17.03.2013 19:57, Alexander V. Chernikov wrote:
On 17.03.2013 13:20, Sami Halabi wrote:
ITOH OpenBSD has a complete implementation of MPLS out of the
box, maybe
Their control plane code is mostly useless due to design approach
(routing daemons talk via kernel).
What's your approach?
It is actually not mine. We have discussed this a bit in
radix-related thread. Generally quagga/bird (and other hiperf
hardware-accelerated and software routers) have feature-rich RIb from
which best routes (possibly multipath) are installed to kernel/fib.
Kernel main task should be to do efficient lookups while every other
advanced feature should be implemented in userland.
Yes, we have started discussing it but haven't reached a conclusion
among the
two philosophies. We have also agreed that the current radix code is
horrible
in terms of cache misses per lookup. That however doesn't preclude an
agnostic
FIB+RIB approach. It's mostly a matter of structure layout to keep it
efficient.
Yes. Additionally, we have problems with misuse of rtalloc api (rte
grabbing).
My point of view is to use separate FIB for 'data plane' e.g. forwarding
while keeping some kind of higher-level
kernel RIB used for route socket, multipath, other subsystem intercation
and so on.
Their data plane code, well.. Yes, we can use some defines from
their headers, but that's all :)
porting it would be short and more straight forward than porting
linux LDP
implementation of BIRD.
It is not 'linux' implementation. LDP itself is cross-platform.
The most tricky place here is control plane.
However, making _fast_ MPLS switching is tricky too, since it
requires chages in our netisr/ethernet
handling code.
Can you explain what changes you think are necessary and why?
>
We definitely need ability to dispatch chain of mbufs - this was
already discussed in intel rx ring lock thread in -net.
Actually I'm not so convinced of that. Packet handling is a tradeoff
between
Yes. But I'm talking on mixed way (one part as batches, to eliminate
contentions, and than - process-to-completion)
doing process-to-completion on each packet and doing context switches
on batches
of packets.
Context switches?
Batches are efficient:
it is noted explicitly in
1) Luigi's VALE paper
http://info.iet.unipi.it/~luigi/papers/20121026-vale.pdf (Section 5.2)
2) Intel/6wind uses batches to move packets to their 'netisr' rings in
their DPDK
3) PacketShader ( http://shader.kaist.edu/packetshader/ ) uses batches too.
Every few years the balance tilts forth and back between
process-to-completion
and batch processing. DragonFly went with a batch-lite token-passing
approach
throughout their kernel. It seems it didn't work out to the extent
they expected.
There are other, more successful solutions with _much_ better results
(100x faster that our code, for example).
Now many parts are moving back to the more traditional locking approach.
Currently significant number of drivers support interrupt moderation
permitting several/tens/hundreds of packets to be received on interrupt.
But they've also started to provide multiple queues.
Yes, but hashing function is pre-defined, and bursty flows can still
fall on single CPU.
For each packet we have to run some basic checks, PFIL hooks, netisr
code, l3 code resulting in many locks being acquired/released per
each packet.
Right, on the other hand you'll likely run into serious interlock and
latency
issues when large batches of packets monopolize certain locks
preventing other
interfaces from sending their batches up.
Typically we rely on NIC to put packet in given queue (direct isr),
which works bad for non-hashable types of traffic like gre, PPPoE,
MPLS. Additionally, hashing function is either standard (from M$
NDIS) or documented permitting someone malicious to generate
'special' traffic matching single queue.
Malicious traffic is always a problem, no matter how many queues you
have.
Currently even if we can add m2flowid/m2cpu function able to hash,
say, gre or MPLS, it is unefficient since we have to lock/unlock
netisr queues for every packet.
Yes, however I'm arguing that our locking strategy may be broken or
sub-optimal.
Various solution reports, say, 50MPPS (or scalable 10-15MPPS per-core)
of IPv4 forwarding. Currenty stock kernel can do ~1MPPS. There are, of
course, other reasons (structure alignment, radix, arp code), but the
difference is _too_ huge.
I'm thinking of
* utilizing m_nextpkt field in mbuf header
OK. That's what it is there for.
* adding some nh_chain flag to netisr
If given netisr does not support flag and nextpkt is not null we
simply call such netisr in cycle.
* netisr hash function accepts mbuf 'chain' and pointer to array
(Sizeof N * ptr), sorts mbuf to N netisr queues saving list heads to
supplied array. After that we put given lists to appropriate queues.
* teach ethersubr RX code to deal with mbuf chains (not easy one)
* add some partial support of handling chains to fastfwd code
I really don't think this going to help much. You're just adding a
lot of
latency and context switches to while packet path. Also you're making it
much more complicated.
The interface drivers and how they manage the boundary between RX ring
and
the stack is not optimal yet. I think there's a lot of potential
there. In
my tcp_workqueue branch I started to experiment with a couple of
approaches.
It's not complete yet though.
The big advantage of having the interface RX thread pushing the
packets is
that it provides a natural feedback loop regarding system load. Once you
have more packets coming in than you can process, the RX dma ring gets
naturally starved and the load is stabilized on the input side preventing
I see no difference here.
a live-lock that can easily happen in batch mode. Only a well-adjusted
driver works properly though and we don't have any yet in that regard.
That's true.
Before we start to invent complicated mbuf batching methods lets make
sure
that the single packet path at its maximal possible efficiency. And only
then evaluate more complicated approaches on whether they deliver
additional
gains.
From that follows that we should:
1. fix longest prefix match radix to minimize cache misses.
Personally I think that 'fix' is rewriting this entirely..
For example, common solution for IPv6 lookup is to use the fact that you
have either /64 or wider routes, or host route, so radix lookup is done
on first 64 bit (and there can be another tree in given element if there
are several more specific routes).
That's why I'm talking on RIB/fib approach (one common 'academic'
implementation for control, and family-depended effective lookup code).
We also need to fix fundamental rte usage paradigm, currently it is
unsafe for both ingress and egress interface..
2. fix drivers to optimize RX dequeuing and TX enqueuing.
3. have a critical look at other parts of the packet path to avoid
For IPv4 fastforwarding we have:
1) RX ring mtx lock, (BPF rlock) (L3 PFIL in, FW lock), Ifaddr RLOCK,
Radix Rlock, rte mtx_lock (twice by default), (L3 PFIL out, FW lock),
ARP rlock, ARP entry rlock, TX ring lock?
(And +2 rlocks for VLAN, and another 2 for LAGG)
There was RX ring lock/unlock thead for 8299 ended with nothing.
I've changed BPF 2 mtx_lock to be 1 RLOCK
There are patches permitting IPFW to use PFIL lock (however they are not
committed due to possibility that we can make lockless PFIL)
I've removed 1 rte mtx_lock (dynamic routes, turned off if forwarding is ON)
I'm working on ARP stack rewrite to (first stage to enable L2 multipath
(lagg-aware) and remove ARP entry lock from forwarding path), second
stage - to store pointer to full L2 prepend header to rte (yes, that was
removed in 7.x)).
I'm a bit stuck of other ideas to eliminate remaining locks (except of
simply moving forwarding code to userland netmap-based solution like
others do)
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"