On Thu, May 11, 2006 at 12:30:32PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: > On Thu, May 11, 2006 at 12:07:21AM -0700, David S. Miller ([EMAIL PROTECTED]) > wrote: > > You can test with single stream, but then you are only testing > > in-cache case. Try several thousand sockets and real load from many > > unique source systems, it becomes interesting then. > > I can test system with large number of streams, but unfortunately only > from small number of different src/dst ip addresses, so I can not > benchmark route lookup performance in layered design.
I've run it with 200 UDP sockets in receive path. There were two load generator machines with 100 clients in each. There are no copies of skb->data in recvmsg(). Since I only have 1Gb link I'm unable to provide each client with high bandwith, so they send 4k chunks. Performance dropped twice down to 55 MB/sec and CPU usage increased noticebly (slow drift from 12 to 8% compared to 2% with one socket), but it is not because of cache effect I believe, but due to highly increased number of syscalls per second. Here is profile result: 1463625 78.0003 poll_idle 19171 1.0217 _spin_lock_irqsave 15887 0.8467 _read_lock 14712 0.7840 kfree 13370 0.7125 ip_frag_queue 11896 0.6340 delay_pmtmr 11811 0.6294 _spin_lock 11723 0.6247 csum_partial 11399 0.6075 ip_frag_destroy 11063 0.5896 serial_in 10533 0.5613 skb_release_data 10524 0.5609 ip_route_input 10319 0.5499 __alloc_skb 9903 0.5278 ip_defrag 9889 0.5270 _read_unlock 9536 0.5082 _write_unlock 8639 0.4604 _write_lock 7557 0.4027 netif_receive_skb 6748 0.3596 ip_frag_intern 6534 0.3482 preempt_schedule 6220 0.3315 __kmalloc 6005 0.3200 schedule 5924 0.3157 irq_entries_start 5823 0.3103 _spin_unlock_irqrestore 5678 0.3026 ip_rcv 5410 0.2883 __kfree_skb 5056 0.2694 kmem_cache_alloc 5014 0.2672 kfree_skb 4900 0.2611 eth_type_trans 4067 0.2167 kmem_cache_free 3532 0.1882 udp_recvmsg 3531 0.1882 ip_frag_reasm 3331 0.1775 _read_lock_irqsave 3327 0.1773 ipq_kill 3304 0.1761 udp_v4_lookup_longway I'm going to resurrect zero-copy sniffer project [1] and create special socket option which would allow to insert pages, which contain skb->data, into process VMA using VM remapping tricks. Unfortunately it requires TLB flushing and probably there will be no significant performance/CPU gain if any, but I think, it is the only way to provide receiving zero-copy access to hardware which does not support header split. Other idea, which I will try, if I understood you correctly, is to create unified cache. I think some interesting results can be obtained from following approach: in softint we do not process skb->data at all, but only get src/dst/sport/dport/protocol numbers (it could require maximum two cache lines, or it is not fast-path packet (but something like ipsec) and can be processed as usual) and create some "initial" cache based on that data, skb is then queued into that "initial" cache entry and recvmsg() in process context later process' that entry. Back to the drawing board... Thanks for discussion. 1. zero-copy sniffer http://tservice.net.ru/~s0mbre/old/?section=projects&item=af_tlb -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html