On Wed, Apr 18, 2018 at 7:17 AM, Paolo Abeni <pab...@redhat.com> wrote: > On Tue, 2018-04-17 at 16:00 -0400, Willem de Bruijn wrote: >> From: Willem de Bruijn <will...@google.com> >> >> Segmentation offload reduces cycles/byte for large packets by >> amortizing the cost of protocol stack traversal. >> >> This patchset implements GSO for UDP. A process can concatenate and >> submit multiple datagrams to the same destination in one send call >> by setting socket option SOL_UDP/UDP_SEGMENT with the segment size, >> or passing an analogous cmsg at send time. >> >> The stack will send the entire large (up to network layer max size) >> datagram through the protocol layer. At the GSO layer, it is broken >> up in individual segments. All receive the same network layer header >> and UDP src and dst port. All but the last segment have the same UDP >> header, but the last may differ in length and checksum. > > This is interesting, thanks for sharing! > > I have some local patches somewhere implementing UDP GRO, but I never > tried to upstream them, since I lacked the associated GSO and I thought > that the use-case was not too relevant. > > Given that your use-case is a connected socket - no per packet route > lookup - how does GSO performs compared to plain sendmmsg()? Have you > considered using and/or improving the latter? > > When testing with Spectre/Meltdown mitigation in places, I expect that > the most relevant part of the gain is due to the single syscall per > burst.
The main benefit is actually not route lookup avoidance. Somewhat to my surprise. The benchmark can be run both in connected and unconnected ('-u') mode. Both saturate the cpu cycles, so only showing throughput: [connected] udp tx: 825 MB/s 588336 calls/s 14008 msg/s [unconnected] udp tx: 711 MB/s 506646 calls/s 12063 msg/s This corresponds to results previously seen with other applications of about 15%. When looking at a perf report, there is no clear hot spot, which indicates that the savings accrue across the protocol stack traversal. I just hacked up a sendmmsg extension to the benchmark to verify. Indeed that does not have nearly the same benefit as GSO: udp tx: 976 MB/s 695394 calls/s 16557 msg/s This matches the numbers seen from TCP without TSO and GSO. That also has few system calls, but observes per MTU stack traversal. I pushed the branch to my github at https://github.com/wdebruij/linux/tree/udpgso-20180418 and also the version I sent for RFC yesterday at https://github.com/wdebruij/linux/tree/udpgso-rfc-v1