On Fri, Feb 19, 2016 at 4:08 PM, Jesse Gross <je...@kernel.org> wrote: > On Fri, Feb 19, 2016 at 3:10 PM, Alex Duyck <adu...@mirantis.com> wrote: >> On Fri, Feb 19, 2016 at 1:53 PM, Jesse Gross <je...@kernel.org> wrote: >>> On Fri, Feb 19, 2016 at 11:26 AM, Alexander Duyck <adu...@mirantis.com> >>> wrote: >>>> This patch series makes it so that we enable the outer Tx checksum for IPv4 >>>> tunnels by default. This makes the behavior consistent with how we were >>>> handling this for IPv6. In addition I have updated the internal flags for >>>> these tunnels so that we use a ZERO_CSUM_TX flag for IPv4 which should >>>> match up will with the ZERO_CSUM6_TX flag which was already in use for >>>> IPv6. >>>> >>>> For most network devices this should be a net gain in terms of performance >>>> as having the outer header checksum present allows for devices to report >>>> CHECKSUM_UNNECESSARY which we can then convert to CHECKSUM_COMPLETE in >>>> order >>>> to determine if the inner header checksum is valid. >>>> >>>> Below is some data I collected with ixgbe with an X540 that demonstrates >>>> this. I located two PFs connected back to back in two different name >>>> spaces and then setup a pair of tunnels on each, one with checksum enabled >>>> and one without. >>>> >>>> Recv Send Send Utilization >>>> Socket Socket Message Elapsed Send >>>> Size Size Size Time Throughput local >>>> bytes bytes bytes secs. 10^6bits/s % S >>>> >>>> noudpcsum: >>>> 87380 16384 16384 30.00 8898.67 12.80 >>>> udpcsum: >>>> 87380 16384 16384 30.00 9088.47 5.69 >>>> >>>> The one spot where this may cause a performance regression is if the >>>> environment contains devices that can parse the inner headers and a device >>>> supports NETIF_F_GSO_UDP_TUNNEL but not NETIF_F_GSO_UDP_TUNNEL_CSUM. In >>>> the case of such a device we have to fall back to using GSO to segment the >>>> tunnel instead of TSO and as a result we may take a performance hit as seen >>>> below with i40e. >>> >>> Do you have any numbers from 40G links? Obviously, at 10G the links >>> are basically saturated and while I can see a difference in the >>> utilization rate, I suspect that the change will be much more apparent >>> at higher speeds. >> >> Unfortunately I don't have any true 40G links to test with. The >> closest I can get is to run PF to VF on an i40e. Running that I have >> seen the numbers go from about 20Gb/s to 15Gb/s with almost all the >> difference being related to the fact that we are having to >> allocate/free more skbs and make more trips through the >> i40e_lan_xmit_frame function resulting in more descriptors. > > OK, I guess that is more or less in line with what I would expect off > the top my head. There is a reasonably significant drop in the worst > case. > >>> I'm concerned about the drop in performance for devices that currently >>> support offloads (almost none of which expose >>> NETIF_F_GSO_UDP_TUNNEL_CSUM as a feature). Presumably the people that >>> care most about tunnel performance are the ones that already have >>> these NICs and will be the most impacted by the drop. >> >> The problem is being able to transmit fast is kind of pointless if the >> receiving end cannot handle it. We hadn't gotten around to really >> getting the Rx checksum bits working until the 3.18 kernel which I >> don't suspect many people are running so at this point messing with >> the TSO bits isn't really making much of a difference. Then on top of >> that most devices have certain limitations on how many ports they can >> handle and such. I know the i40e is supposed to support something >> like 10 port numbers, but the fm10k and ixgbe are limited to one port >> as I recall. So this whole thing is already really brittle as it is. >> My goal with this change is to make the behavior more consistent >> across the board. > > That's true to some degree but there are certainly plenty of cases > where TSO makes a difference - lower CPU usage, transmitting to > multiple receivers, people will upgrade their kernels, etc. It's > clearly good to make things more consistent but hopefully not by > reducing existing performance. :) > >>> My hope is that we can continue to use TSO on devices that only >>> support NETIF_F_GSO_UDP_TUNNEL. The main problem is that the UDP >>> length field may vary across segments. However, in practice this is >>> the only on the final segment and only in cases where the total length >>> is not a multiple of the MSS. If we could detect cases where those >>> conditions are met, we could continue to use TSO with the UDP checksum >>> field pre-populated. A possible step even further would be to break >>> off the final segment into a separate packet to make things conform if >>> necessary. This would avoid a performance regression and I think make >>> this more palatable to a lot of people. >> >> I think Tom and I had discussed this possibility a bit at netconf. >> The GSO logic is something I planned on looking at over the next >> several weeks as I suspect there is probably room for improvement >> there. > > That sounds great. > >>>> I also haven't investigated the effect this will have on OVS. However I >>>> suspect the impact should be minimal as the worst case scenario should be >>>> that Tx checksumming will become enabled by default which should be >>>> consistent with the existing behavior for IPv6. >>> >>> I don't think that it should cause any problems. >> >> Good to hear. >> >> Do you know if OVS has some way to control the VXLAN configuration so >> that it could disable Tx checksums? If so that would probably be a >> good way to address the 40G issues assuming someone is running an >> environment hat had nothing but NICs that can support the TSO and Rx >> checksum on inner headers. > > Yes - OVS can control tx checksums on a per-endpoint basis (actually, > rx checksum present requirements as well though it's not exposed to > the user at the moment). If you had the information then you could > optimize what to use in an environment of, say, hypervisors and > hardware switches. > > However, it's certainly possible that you have a mixed set of NICs > such as an encap aware NIC on the transmit side and non-aware on the > receive side. In that case, both possible checksum settings penalize > somebody: off (lose GRO on receiver), on (lose TSO on sender assuming > no support for NETIF_F_GSO_UDP_TUNNEL_CSUM). That's why I think it's > important to be able to use encap TSO with local checksum to avoid > these bad tradeoffs, not to mention being cleaner.
By "local checksum" do you mean LCO? Seems like we should be able to get that to work with NETIF_F_GSO_TUNNEL_CSUM. Tom