[Kernel-packages] [Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs

Jens Elkner Mon, 16 Mar 2020 07:56:59 -0700

We use 'Linux kino6 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41
UTC 2019 x86_64 x86_64 x86_64 GNU/Linux' (Ubuntu 18.04.3 LTS) and see
all the time 'hw csum failure's:


[ +28.297139] kino6_0: hw csum failure
[  +0.003607] CPU: 12 PID: 0 Comm: swapper/12 Tainted: P           O     
4.15.0-58-generic #64-Ubuntu
[  +0.000003] Hardware name: GIGABYTE G291-281-00/MG51-G21-00, BIOS R06 
11/19/2019
[  +0.000001] Call Trace:
[  +0.000002]  <IRQ>
[  +0.000011]  dump_stack+0x63/0x8b
[  +0.000008]  netdev_rx_csum_fault+0x38/0x40
[  +0.000003]  __skb_checksum_complete+0xbc/0xd0
[  +0.000005]  nf_ip_checksum+0xc3/0xf0
[  +0.000018]  tcp_error+0x162/0x1c0 [nf_conntrack]
[  +0.000006]  ? ttwu_do_wakeup+0x1e/0x140
[  +0.000011]  nf_conntrack_in+0x14f/0x500 [nf_conntrack]
[  +0.000007]  ? csum_partial_ext+0x9/0x10
[  +0.000007]  ? __skb_checksum+0x6b/0x300
[  +0.000006]  ipv4_conntrack_in+0x1c/0x20 [nf_conntrack_ipv4]
[  +0.000005]  nf_hook_slow+0x48/0xc0
[  +0.000004]  ? skb_send_sock+0x50/0x50
[  +0.000005]  ip_rcv+0x2fa/0x360
[  +0.000003]  ? inet_del_offload+0x40/0x40
[  +0.000004]  __netif_receive_skb_core+0x432/0xb40
[  +0.000003]  ? update_curr+0xf2/0x1d0
[  +0.000004]  ? tcp4_gro_receive+0x137/0x1a0
[  +0.000003]  __netif_receive_skb+0x18/0x60
[  +0.000003]  ? __netif_receive_skb+0x18/0x60
[  +0.000003]  netif_receive_skb_internal+0x45/0xe0
[  +0.000004]  napi_gro_receive+0xc5/0xf0
[  +0.000036]  mlx5e_handle_rx_cqe_mpwrq+0x465/0x860 [mlx5_core]
[  +0.000028]  mlx5e_poll_rx_cq+0xd1/0x8b0 [mlx5_core]
[  +0.000025]  mlx5e_napi_poll+0x9d/0x290 [mlx5_core]
[  +0.000004]  net_rx_action+0x140/0x3a0
[  +0.000005]  __do_softirq+0xe4/0x2d4
[  +0.000006]  irq_exit+0xc5/0xd0
[  +0.000003]  do_IRQ+0x8a/0xe0
[  +0.000003]  common_interrupt+0x8c/0x8c
[  +0.000002]  </IRQ>
[  +0.000005] RIP: 0010:cpuidle_enter_state+0xa7/0x2f0
[  +0.000002] RSP: 0018:ffffad7a00283e68 EFLAGS: 00000246 ORIG_RAX: 
ffffffffffffffdd
[  +0.000004] RAX: ffff89f13fd22840 RBX: 000002273da74e91 RCX: 000000000000001f
[  +0.000002] RDX: 000002273da74e91 RSI: fffffeba65558937 RDI: 0000000000000000
[  +0.000002] RBP: ffffad7a00283ea8 R08: 0000000000000004 R09: 0000000000022080
[  +0.000001] R10: ffffad7a00283e38 R11: 000007e0dcda6658 R12: ffffcd5a00503298
[  +0.000002] R13: 0000000000000003 R14: ffffffffb3f72e78 R15: 0000000000000000
[  +0.000004]  ? cpuidle_enter_state+0x97/0x2f0
[  +0.000003]  cpuidle_enter+0x17/0x20
[  +0.000005]  call_cpuidle+0x23/0x40
[  +0.000004]  do_idle+0x18c/0x1f0
[  +0.000004]  cpu_startup_entry+0x73/0x80
[  +0.000005]  start_secondary+0x1ab/0x200
[  +0.000005]  secondary_startup_64+0xa5/0xb0

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1840854

Title:
  mlx5_core reports hardware checksum error for padded packets on
  Mellanox NICs

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1840854

  [Impact]

  On machines equipped with Mellanox NIC's, in this particular case,
  Mellanox 5 series NICs using the mlx5_core driver, after installing
  4.15.0-56 or later there is the following kernel splat:

  bond0: hw csum failure
  CPU: 63 PID: 2473 Comm: in:imklog Tainted: P OE 4.15.0-58-generic 
#64~16.04.1-Ubuntu
  Call Trace:
  <IRQ>
  dump_stack+0x63/0x8b
  netdev_rx_csum_fault+0x38/0x40
  __skb_checksum_complete+0xc0/0xd0
  nf_ip_checksum+0xca/0xf0
  tcp_error+0xe0/0x1a0 [nf_conntrack]
  ? tcp_v4_rcv+0x7c6/0xa70
  nf_conntrack_in+0xde/0x520 [nf_conntrack]
  ipv4_conntrack_in+0x1c/0x20 [nf_conntrack_ipv4]
  nf_hook_slow+0x48/0xd0
  ? skb_send_sock+0x50/0x50
  ip_rcv+0x30f/0x370
  ? inet_del_offload+0x40/0x40
  __netif_receive_skb_core+0x879/0xba0
  ? tcp4_gro_receive+0x117/0x1b0
  __netif_receive_skb+0x18/0x60
  ? __netif_receive_skb+0x18/0x60
  netif_receive_skb_internal+0x45/0xf0
  napi_gro_receive+0xd0/0xf0
  mlx5e_handle_rx_cqe_mpwrq+0x4a1/0x8a0 [mlx5_core]
  mlx5e_poll_rx_cq+0xc3/0x880 [mlx5_core]
  mlx5e_napi_poll+0x9b/0x280 [mlx5_core]
  net_rx_action+0x265/0x3b0
  __do_softirq+0xf5/0x2a8
  irq_exit+0xca/0xd0
  do_IRQ+0x57/0xe0
  common_interrupt+0x8c/0x8c
  </IRQ>

  In 4.15.0-56, a commit was added from upstream -stable that introduced
  an optimisation for checksumming packets which have had zero bytes
  padded to the end of the packet.

  commit 88078d98d1bb085d72af8437707279e203524fa5
  Author: Eric Dumazet <eduma...@google.com>
  Date: Wed Apr 18 11:43:15 2018 -0700
  subject: net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends

  You can read it here:
  
https://github.com/torvalds/linux/commit/88078d98d1bb085d72af8437707279e203524fa5

  It was discussed in this bugzilla link:
  https://bugzilla.kernel.org/show_bug.cgi?id=201849

  This commit causes problems with a number of NIC devices, including Mellanox.
  This is best described by the maintainer, Dimitris Michailidis:

  > > > MLNX devices have an issue with packets that are padded past the end of
  > > > the L3 payload with bytes that aren't all 0s. They use a mode of 
checksum
  > > > reporting which should be including the padding bytes but MLNX devices
  > > > leave those out. When the padding bytes aren't all 0 this omission 
causes
  > > > a checksum error. This device behavior has existed for a long time but 
it
  > > > has begun causing errors only this year. Before a padded packet had its
  > HW
  > > > checksum ignored so it wasn't material what HW had reported. More
  > recently
  > > > padded packet checksums started using the HW value and now it is
  > > > noticeable when that value isn't right.

  Now, some routers stick additional information in the zero padding
  section on occasion, which will change the hardware checksum. Since
  the hardware checksum was ignored until 4.15.0-56 with
  88078d98d1bb085d72af8437707279e203524fa5, this wasn't an issue. But
  with the optimisation, we start running into trouble since the
  hardware checksums no longer match what the kernel is expecting.

  [Fix]

  This was fixed for Mellanox 4 and 5 series drivers recently.

  Mellanox 4: 74abc07dee613086f9c0ded9e263ddc959a6de04
  
https://github.com/torvalds/linux/commit/74abc07dee613086f9c0ded9e263ddc959a6de04

  Mellanox 5: e8c8b53ccaff568fef4c13a6ccaf08bf241aa01a
  
https://github.com/torvalds/linux/commit/e8c8b53ccaff568fef4c13a6ccaf08bf241aa01a

  This customer hit the issue with mlx5_core driver, so the fix is:

  commit e8c8b53ccaff568fef4c13a6ccaf08bf241aa01a
  Author: Cong Wang <xiyou.wangc...@gmail.com>
  Date: Mon Dec 3 22:14:04 2018 -0800
  subject: net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames

  This is actually present in 4.15.0-59, which is currently sitting in
  -proposed.

  The commits are a part of 4.9.156, 4.14.99, 4.19.21 upstream -stable
  releases, and have been pulled into bionic as a part of LP #1837664

  [Testcase]

  Simply try and bring an interface up on a machine with Mellanox series
  5 NICs.

  When a packet comes through which is smaller than required and padding
  is added, the problem will be triggered.

  The 4.15.0-59 from -proposed has been tested by the customer, and
  resolves the issue.

  [Regression Potential]

  This patch has a low chance of regression since it fixes a regression
  introduced by 88078d98d1bb085d72af8437707279e203524fa5 in 4.15.0-56.

  The changes are limited to mlx5_core driver, and have been well tested
  and accepted by the community due to their selection for upstream
  -stable.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840854/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1840854] Re: mlx5_core reports hardware checksum error for padded packets on Mellanox NICs

Reply via email to