From: Willem de Bruijn <will...@google.com> Segmentation offload reduces cycles/byte for large packets by amortizing the cost of protocol stack traversal.
This patchset implements GSO for UDP. A process can concatenate and submit multiple datagrams to the same destination in one send call by setting socket option SOL_UDP/UDP_SEGMENT with the segment size, or passing an analogous cmsg at send time. The stack will send the entire large (up to network layer max size) datagram through the protocol layer. At the GSO layer, it is broken up in individual segments. All receive the same network layer header and UDP src and dst port. All but the last segment have the same UDP header, but the last may differ in length and checksum. This initial patchset is RFC. A few open items * MSG_MORE The feature requires UDP checksum offload, as without it the checksum + copy operation at send() time is likely cheaper than checksumming each segment in the GSO layer. UDP checksum offload is disabled with MSG_MORE. As a result, GSO only works in the lockless fast path. The patchset can be simplified if explicitly excluding MSG_MORE. For one, patch 1 can be dropped by passing ipcm to udp_send_skb instead of inet_cork. * MSG_ZEROCOPY UDP zerocopy has been sent for review before. Completion notification cost exceeds the savings from copy avoidance for datagrams of regular MSS (< 1500B). UDP GSO enables building larger packets, at which point zerocopy becomes effective. Results with the current benchmark are not as great as from GSO itself, though that may say more about the benchmark. Either way, I do not intend to submit this separate feature as part of a final UDP GSO patchset. * GSO_BY_FRAGS An alternative implementation that would allow non-uniform segment length is to use GSO_BY_FRAGS like SCTP. This would likely require MSG_MORE to build the list using multiple send calls (or one sendmmsg). The two approaches are not mutually-exclusive, so that could be a follow-up. Initial results show a significant reduction in UDP cycles/byte. See the main patch for more details and benchmark results. udp 876 MB/s 14873 msg/s 624666 calls/s 11,205,777,429 cycles udp gso 2139 MB/s 36282 msg/s 36282 calls/s 11,204,374,561 cycles The patch set is broken down as follows: - patch 1 is a prerequisite: code rearrangement, noop otherwise - patch 2 is the core feature - patch 3,4,6 are refinements - patch 5 adds the cmsg interface - patch 7 adds udp zerocopy - patch 8..11 are tests This idea was presented previously at netconf 2017-2 http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf Known limitation: - The feature requires pacing and possibly a lower threshold on segment size to limit the number of segments that may be passed to the NIC at once. - Even when only accepting datagrams with CHECKSUM_PARTIAL, the segmentation layer must drop or fall back to software checksumming if the device cannot checksum the packet. This can happen if a device advertises checksum offload in general, but removes it for this skb in .ndo_features_check. Willem de Bruijn (11): udp: expose inet cork to udp udp: add gso udp: better wmem accounting on gso udp: paged allocation with gso udp: add gso segment cmsg udp: add gso support to virtual devices udp: zerocopy selftests: udp gso selftests: udp gso with connected sockets selftests: udp gso with corking selftests: udp gso benchmark include/linux/netdev_features.h | 3 + include/linux/skbuff.h | 10 + include/linux/udp.h | 1 + include/net/inet_sock.h | 1 + include/net/ip.h | 3 +- include/net/ipv6.h | 2 + include/net/udp.h | 5 + include/uapi/linux/udp.h | 1 + net/core/skbuff.c | 14 +- net/core/sock.c | 5 +- net/ipv4/af_inet.c | 2 +- net/ipv4/ip_output.c | 63 +- net/ipv4/udp.c | 78 ++- net/ipv4/udp_offload.c | 63 ++ net/ipv6/ip6_offload.c | 5 +- net/ipv6/ip6_output.c | 66 +- net/ipv6/udp.c | 29 +- net/ipv6/udp_offload.c | 14 + tools/testing/selftests/net/.gitignore | 3 + tools/testing/selftests/net/Makefile | 3 +- tools/testing/selftests/net/udpgso.c | 621 ++++++++++++++++++ tools/testing/selftests/net/udpgso.sh | 31 + tools/testing/selftests/net/udpgso_bench.sh | 74 +++ tools/testing/selftests/net/udpgso_bench_rx.c | 265 ++++++++ tools/testing/selftests/net/udpgso_bench_tx.c | 379 +++++++++++ 25 files changed, 1689 insertions(+), 52 deletions(-) create mode 100644 tools/testing/selftests/net/udpgso.c create mode 100755 tools/testing/selftests/net/udpgso.sh create mode 100755 tools/testing/selftests/net/udpgso_bench.sh create mode 100644 tools/testing/selftests/net/udpgso_bench_rx.c create mode 100644 tools/testing/selftests/net/udpgso_bench_tx.c -- 2.17.0.484.g0c8726318c-goog