Hi,

First, thanks for taking a look at this.


On 01/23/2018 01:53 AM, Antoine Tenart wrote:
Hi Jeremy,

On Mon, Jan 22, 2018 at 05:14:27PM -0600, Jeremy Linton wrote:

I'm running 4.15rc7 and hitting the following crash on the MACCHIATObin.
This is 100% reproducible once the adapter is given any load. Within a few
seconds of starting a scp or nfs copies inbound to the machine it dies like
this:


[12544.192436] mvpp2 f4000000.ethernet eth2: wrong cpu on the end of Tx
processing
[12548.513734] mvpp2 f4000000.ethernet eth2: wrong cpu on the end of Tx
processing
[12548.623574] mvpp2 f4000000.ethernet eth2: wrong cpu on the end of Tx
processing

I believe this is the root cause of this issue: txq_done() is scheduled
on the wrong CPU and we know it can't run on 2 CPUs at the same time. We
had a similar issue (same stack trace, different root cause):
082297e61480c4d72ed75b31077e74aca0e7c799

I'm pretty sure I already had that patch, I've rebased to 4.15rc9 and it continues. I also cherry picked "net: mvpp2: only free the TSO header buffers when it was allocated" from net-next which didn't appear to fix it either.

Thanks,



Thanks for reporting this!

Antoine

[12548.630943] Unable to handle kernel paging request at virtual address
97ffd6fdd28000e8
[12548.638897] Mem abort info:
[12548.641703]   ESR = 0x96000004
[12548.644775]   Exception class = DABT (current EL), IL = 32 bits
[12548.650720]   SET = 0, FnV = 0
[12548.653795]   EA = 0, S1PTW = 0
[12548.656952] Data abort info:
[12548.659846]   ISV = 0, ISS = 0x00000004
[12548.663700]   CM = 0, WnR = 0
[12548.666684] [97ffd6fdd28000e8] address between user and kernel address
ranges
[12548.673855] Internal error: Oops: 96000004 [#1] SMP
[12548.678757] Modules linked in: ax88179_178a usbnet ip6t_rpfilter
ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat
ebtable_brox
[12548.749992]  xhci_plat_hcd ahci_platform [last unloaded: usbnet]
[12548.756034] CPU: 3 PID: 0 Comm: swapper/3 Not tainted
4.15.0-0.rc7.git0.1.fc28.aarch64 #1
[12548.764249] Hardware name: Marvell Armada 8040 MacchiatoBin/Armada 8040
MacchiatoBin, BIOS EDK II Oct  2 2017
[12548.774210] pstate: 40400005 (nZcv daif +PAN -UAO)
[12548.779033] pc : consume_skb+0x1c/0xd8
[12548.782802] lr : __dev_kfree_skb_any+0x58/0x68
[12548.787264] sp : ffff00000801bc30
[12548.790594] x29: ffff00000801bc30 x28: ffff831bed412a40
[12548.795934] x27: ffff831bf7ce8000 x26: 0000000000000001
[12548.801273] x25: ffff27e28d746120 x24: ffff831bed412948
[12548.806612] x23: 0000000000000018 x22: ffff27e28d746120
[12548.811950] x21: 0000000000000007 x20: 0000000000000001
[12548.817289] x19: 97ffd6fdd2800004 x18: 0000000000000010
[12548.822627] x17: 0000000000000000 x16: ffff27e28d5bb4a0
[12548.827966] x15: ffffffffffffffff x14: 737365636f727020
[12548.833305] x13: 785420666f20646e x12: 6520656874206e6f
[12548.838643] x11: ffff27e28e07b448 x10: ffff27e28d35eb00
[12548.843981] x9 : 2074656e72656874 x8 : 0000000000000005
[12548.849319] x7 : 00000000b26f0000 x6 : 00000000b66f0000
[12548.854658] x5 : 0000000000000001 x4 : 0000000000000000
[12548.859995] x3 : 0000000000000001 x2 : 97ffd6fdd2800004
[12548.865333] x1 : 0000000000000001 x0 : ffff27e28d5bb4f8
[12548.870673] Process swapper/3 (pid: 0, stack limit = 0x0000000071feb006)
[12548.877404] Call trace:
[12548.879863]  consume_skb+0x1c/0xd8
[12548.883281]  __dev_kfree_skb_any+0x58/0x68
[12548.887411]  mvpp2_txq_bufs_free.isra.53+0xd0/0x118 [mvpp2]
[12548.893017]  mvpp2_txq_done.isra.68+0xb0/0xf8 [mvpp2]
[12548.898100]  mvpp2_tx_done+0xb4/0x118 [mvpp2]
[12548.902484]  mvpp2_poll+0x5c4/0x658 [mvpp2]
[12548.906688]  net_rx_action+0x160/0x3f8
[12548.910456]  __do_softirq+0x138/0x344
[12548.914137]  irq_exit+0xd0/0x100
[12548.917381]  __handle_domain_irq+0x6c/0xc0
[12548.921497]  gic_handle_irq+0x60/0xb0
[12548.925175]  el1_irq+0xd8/0x180
[12548.928331]  arch_cpu_idle+0x30/0x188
[12548.932011]  do_idle+0x138/0x1f8
[12548.935255]  cpu_startup_entry+0x2c/0x30
[12548.939197]  secondary_start_kernel+0x11c/0x130
[12548.943750] Code: aa0003f3 aa1e03e0 d503201f b4000153 (b940e660)
[12548.949876] ---[ end trace c9cfd11479961f0c ]---
[12548.954515] Kernel panic - not syncing: Fatal exception in interrupt
[12548.960900] SMP: stopping secondary CPUs
[12548.964845] Kernel Offset: 0x27e284d50000 from 0xffff000008000000
[12548.970967] CPU features: 0x002000
[12548.974384] Memory Limit: none

Its interesting that the wrong CPU messages are still appearing despite the
irqbalance change from MarkZ. I disabled irqbalance and tried starting it in
single queue mode and it did the same thing.



Reply via email to