On Sun, 24 Dec 2006, Scott Long wrote:
I try this experiement every few years, and generally don't measure much
improvement. I'll try it again with 10gbps early next year once back in
the office again. The more interesting transition is between the link
layer and the network layer, which is high on my list of topics to look
into in the next few weeks. In particular, reworking the ifqueue handoff.
The tricky bit is balancing latency, overhead, and concurrency...
FYI, there are several sets of patches floating around to modify if_em to
hand off queues of packets to the link layer, etc. They probably need
updating, of course, since if_em has changed quite a bit in the last year.
In my implementaiton, I add a new input routine that accepts mbuf packet
queues.
Have you tested this with more than just your simple netblast and netperf
tests? Have you measured CPU usage during your tests? With 10Gb coming,
pipelined processing of RX packets is becoming an interesting topic for all
OSes from a number of companies. I understand your feeling about the
bottleneck being higher up than at just if_input. We'll see how this holds
up.
In my previous test runs, I was generally testing two general scenarios:
(1) Local sink - sinking small and large packet sizes to a single socket at a
high rate.
(2) Local source - sourcing small and large packet sizes via a single socket
at a high rate.
(3) IP forwarding - both unidirectional and bidirectional packet streams
acrossan IP forwarding host with small and large packet sizes.
From the perspective of optimizing these particular paths, small packet sizes
best reveal processing overhead up to about the TCP/socket buffer layer on
modern hardware (DMA, etc). The uni/bidirectional axis is interesting because
it helps reveal the impact of the direct dispatch vs. netisr dispatch choice
for the IP layer with respect to exercising parallelism. I didn't explicitly
measure CPU, but as the configurations max out the CPUs in my test bed,
typically any significant CPU reduction is measurable in an improvement in
throughput. For example, I was easily able to measure the CPU reduction in
switching from using the socket reference to the file descriptor reference in
sosend() on small packet transmit, which was a relatively minor functional
change in locking and reference counting.
I have tentative plans to explicitly measuring cycle counts between context
switches and during dispatches, but have not yet implemented that in the new
setup. I expect to have a chance to set up these new test runs and get back
into experimenting with the dispatch model between the device driver, link
layer, and network layer sometime in mid-January. As the test runs are very
time-consuming, I'd welcome suggestions on the testing before, rather than
after, I run them. :-)
Robert N M Watson
Computer Laboratory
University of Cambridge
_______________________________________________
cvs-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/cvs-all
To unsubscribe, send any mail to "[EMAIL PROTECTED]"