While testing my inet defrag changes, I found that senders could spend ~20% of cpu cycles in skb_set_owner_w() updating sk->sk_wmem_alloc for every fragment they cook, competing with TX completion of prior skbs possibly happening on another cpus.
One solution to this problem is to use alloc_skb() instead of sock_wmalloc() and manually perform a single sk_wmem_alloc change. This greatly increases speed for applications sending big UDP datagrams. Eric Dumazet (2): ipv4: factorize sk_wmem_alloc updates done by __ip_append_data() ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data() net/ipv4/ip_output.c | 17 ++++++++++++----- net/ipv6/ip6_output.c | 17 ++++++++++++----- 2 files changed, 24 insertions(+), 10 deletions(-) -- 2.17.0.rc1.321.gba9d0f2565-goog