On Thu, Nov 15, 2018 at 5:52 PM Herbert Xu <herb...@gondor.apana.org.au> wrote: > > On Thu, Nov 15, 2018 at 03:16:02PM -0800, Cong Wang wrote: > > The following evidences indicate this check is likely wrong: > > > > 1. In the assignment "skb->csum_valid = !sum", sum==0 indicates a valid > > checksum. > > > > 2. __skb_checksum_complete() always returns sum, and TCP packets are dropped > > only when it returns non-zero. So non-zero indicates a failure. > > > > 3. In __skb_checksum_validate_complete(), we have a nearly same check, where > > zero is considered as success. > > > > 4. csum_fold() already does the one’s complement, this indicates 0 should > > be considered as a successful validation. > > > > 5. We have triggered this fault for many times, but InCsumErrors field in > > /proc/net/snmp remains 0. > > > > Base on the above, non-zero should be used as a checksum mismatch. > > > > I tested this with mlx5 driver, no warning or InCsumErrors after 1 hour. > > > > Fixes: fb286bb2990a ("[NET]: Detect hardware rx checksum faults correctly") > > Cc: Herbert Xu <herb...@gondor.apana.org.au> > > Cc: Tom Herbert <t...@herbertland.com> > > Cc: Eric Dumazet <eduma...@google.com> > > Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com> > > --- > > net/core/datagram.c | 4 ++-- > > net/core/dev.c | 2 +- > > 2 files changed, 3 insertions(+), 3 deletions(-) > > > > diff --git a/net/core/datagram.c b/net/core/datagram.c > > index 57f3a6fcfc1e..e542a9a212a7 100644 > > --- a/net/core/datagram.c > > +++ b/net/core/datagram.c > > @@ -733,7 +733,7 @@ __sum16 __skb_checksum_complete_head(struct sk_buff > > *skb, int len) > > __sum16 sum; > > > > sum = csum_fold(skb_checksum(skb, 0, len, skb->csum)); > > - if (likely(!sum)) { > > + if (unlikely(sum)) { > > if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) && > > !skb->csum_complete_sw) > > netdev_rx_csum_fault(skb->dev); > > Normally if the hardware's partial checksum is valid then we just > trust it and send the packet along. However, if the partial > checksum is invalid we don't trust it and we will compute the > whole checksum manually which is what ends up in sum.
Not sure if I understand partial checksum here, but it is the CHECKSUM_COMPLETE case which I am trying to fix, not CHECKSUM_PARTIAL. Or you mean the checksum returned by skb_checksum(), that is, checksum from skb->data to skb->data+skb->len. If neither, I am confused. > > netdev_rx_csum_fault is meant to warn about the situation where > a packet with a valid checksum (i.e., sum == 0) was given to us > by the hardware with a partial checksum that was invalid. > > So changing it to sum here is wrong. > So, in other word, a checksum *match* is the intended to detect this HW RX checksum fault? What has been changed in between skb_checksum_init() and tcp_checksum_complete() so that the logic is inverted? Looks like I miss something too obvious to understand the logic. :-/ > Can you give more information as to how you got the warnings with > mlx5? It sounds like there may be a real bug there because if you > are getting the warning then it means that a packet with an invalid > hardware-computed partial checksum passed the manual check and > was actually valid. This implies that either the hardware or the > driver is broken. Sure, my case is nearly same with Pawel's, except I have no vlan: https://marc.info/?l=linux-netdev&m=154086647601721&w=2 None of us has RXFCS, if you are curious whether Eric's fix works for us. There are also a few other reports with conntrack involved: https://marc.info/?l=linux-netdev&m=154134983130200&w=2 https://marc.info/?l=linux-netdev&m=154070099731902&w=2 Thanks.