On 29.08.2013 15:49, Adrian Chadd wrote:
Hi,
Hello Adrian!
I'm very sorry for the looong reply.
There's a lot of good stuff to review here, thanks!
Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to
keep locking things like that on a per-packet basis. We should be able
to do this in a cleaner way - we can defer RX into a CPU pinned
taskqueue and convert the interrupt handler to a fast handler that
just schedules that taskqueue. We can ignore the ithread entirely here.
What do you think?
Well, it sounds good :) But performance numbers and Jack opinion is more
important :)
Are you going to Malta?
Totally pie in the sky handwaving at this point:
* create an array of mbuf pointers for completed mbufs;
* populate the mbuf array;
* pass the array up to ether_demux().
For vlan handling, it may end up populating its own list of mbufs to
push up to ether_demux(). So maybe we should extend the API to have a
bitmap of packets to actually handle from the array, so we can pass up
a larger array of mbufs, note which ones are for the destination and
then the upcall can mark which frames its consumed.
I specifically wonder how much work/benefit we may see by doing:
* batching packets into lists so various steps can batch process
things rather than run to completion;
* batching the processing of a list of frames under a single lock
instance - eg, if the forwarding code could do the forwarding lookup
for 'n' packets under a single lock, then pass that list of frames up
to inet_pfil_hook() to do the work under one lock, etc, etc.
I'm thinking the same way, but we're stuck with 'forwarding lookup' due
to problem with egress interface pointer, as I mention earlier. However
it is interesting to see how much it helps, regardless of locking.
Currently I'm thinking that we should try to change radix to something
different (it seems that it can be checked fast) and see what happened.
Luigi's performance numbers for our radix are too awful, and there is a
patch implementing alternative trie:
http://info.iet.unipi.it/~luigi/papers/20120601-dxr.pdf
http://www.nxlab.fer.hr/dxr/stable_8_20120824.diff
Here, the processing would look less like "grab lock and process to
completion" and more like "mark and sweep" - ie, we have a list of
frames that we mark as needing processing and mark as having been
processed at each layer, so we know where to next dispatch them.
I still have some tool coding to do with PMC before I even think about
tinkering with this as I'd like to measure stuff like per-packet
latency as well as top-level processing overhead (ie,
CPU_CLK_UNHALTED.THREAD_P / lagg0 TX bytes/pkts, RX bytes/pkts, NIC
interrupts on that core, etc.)
That will be great to see!
Thanks,
-adrian
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"