On Sun, 24 Dec 2006, Scott Long wrote:

I try this experiement every few years, and generally don't measure much improvement. I'll try it again with 10gbps early next year once back in the office again. The more interesting transition is between the link layer and the network layer, which is high on my list of topics to look into in the next few weeks. In particular, reworking the ifqueue handoff. The tricky bit is balancing latency, overhead, and concurrency...

FYI, there are several sets of patches floating around to modify if_em to hand off queues of packets to the link layer, etc. They probably need updating, of course, since if_em has changed quite a bit in the last year. In my implementaiton, I add a new input routine that accepts mbuf packet queues.

Have you tested this with more than just your simple netblast and netperf tests? Have you measured CPU usage during your tests? With 10Gb coming, pipelined processing of RX packets is becoming an interesting topic for all OSes from a number of companies. I understand your feeling about the bottleneck being higher up than at just if_input. We'll see how this holds up.

In my previous test runs, I was generally testing two general scenarios:

(1) Local sink - sinking small and large packet sizes to a single socket at a
    high rate.

(2) Local source - sourcing small and large packet sizes via a single socket
    at a high rate.

(3) IP forwarding - both unidirectional and bidirectional packet streams
    acrossan IP forwarding host with small and large packet sizes.

From the perspective of optimizing these particular paths, small packet sizes
best reveal processing overhead up to about the TCP/socket buffer layer on modern hardware (DMA, etc). The uni/bidirectional axis is interesting because it helps reveal the impact of the direct dispatch vs. netisr dispatch choice for the IP layer with respect to exercising parallelism. I didn't explicitly measure CPU, but as the configurations max out the CPUs in my test bed, typically any significant CPU reduction is measurable in an improvement in throughput. For example, I was easily able to measure the CPU reduction in switching from using the socket reference to the file descriptor reference in sosend() on small packet transmit, which was a relatively minor functional change in locking and reference counting.

I have tentative plans to explicitly measuring cycle counts between context switches and during dispatches, but have not yet implemented that in the new setup. I expect to have a chance to set up these new test runs and get back into experimenting with the dispatch model between the device driver, link layer, and network layer sometime in mid-January. As the test runs are very time-consuming, I'd welcome suggestions on the testing before, rather than after, I run them. :-)

Robert N M Watson
Computer Laboratory
University of Cambridge
_______________________________________________
cvs-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/cvs-all
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to