RE: [PATCH v1 net] lan743x: fix return value for lan743x_tx_napi_poll

Bryan.Whitehead Tue, 20 Nov 2018 13:40:17 -0800

> -----Original Message-----
> From: Andrew Lunn <[email protected]>
> Sent: Tuesday, November 20, 2018 2:31 PM
> To: Bryan Whitehead - C21958 <[email protected]>
> Cc: [email protected]; [email protected]; UNGLinuxDriver
> <[email protected]>
> Subject: Re: [PATCH v1 net] lan743x: fix return value for
> lan743x_tx_napi_poll
> 
> On Tue, Nov 20, 2018 at 01:26:43PM -0500, Bryan Whitehead wrote:
> > It has been noticed that under stress the lan743x driver will
> > sometimes hang or cause a kernel panic. It has been noticed that
> > returning '0' instead of 'weight' fixes this issue.
> >
> > fixes: rare kernel panic under heavy traffic load.
> > Signed-off-by: Bryan Whitehead <[email protected]>
> 
> Hi Bryan
> 
> This sounds like a band aid over something which is broken, not a real fix.
> 
> Can you show us the stack trace from the panic?
> 
>     Andrew


Andrew,

Admittedly, my knowledge of what the kernel is doing behind the scenes is 
limited.

But according to documentation found on 
https://wiki.linuxfoundation.org/networking/napi

It states the following
"The poll() function may also process TX completions, in which case if it 
processes
the entire TX ring then it should count that work as the rest of the budget.
Otherwise, TX completions are not counted."

So based on that, the original driver was returning the full budget. But I was 
having
Issues with it. And the above documentation seems to suggest that I could 
return 0
As in "not counted" from above.

I tried it, and my lock up issues disappeared.

Regarding the kernel panic stack trace. So far its very hard to replicate that 
on the 
latest kernel. I've seen it more frequently when back porting to older kernels 
such
as 4.14, and 4.9. This same fix caused those kernel panics to disappear.
Are you interested in seeing a stack dump from older kernels?

In the latest kernel the issue manifests as a kernel message which states
"[  945.021101] enp48s0: Budget exhausted after napi rescheduled"

I'm not sure what that means. But it does not lock up immediately after seeing 
that
Message. But it usually locks up with in a minute of seeing that message.

And the sometimes I get the following warning
[ 1240.425020] ------------[ cut here ]------------
[ 1240.426014] NETDEV WATCHDOG: enp0s25 (e1000e): transmit queue 0 timed out
[ 1240.430027] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:461 
dev_watchdog+0x1ef/0x200
[ 1240.430027] Modules linked in: lan743x
[ 1240.430027] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G          I       4.19.2 
#1
[ 1240.430027] Hardware name: Hewlett-Packard HP Compaq dc7900 Convertible 
Minitower/3032h, BIOS 786G1 v01.16 03/05/2009
[ 1240.430027] RIP: 0010:dev_watchdog+0x1ef/0x200
[ 1240.430027] Code: 00 48 63 4d e0 eb 93 4c 89 e7 c6 05 68 30 b3 00 01 e8 25 
3d fd ff 89 d9 48 89 c2 4c 89 e6 48 c7 c7 98 92 48 ab e8 f1 28 87 ff <0f> 0b eb 
c0 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 48 c7 47 08 00
[ 1240.430027] RSP: 0018:ffff98490be03e90 EFLAGS: 00010282
[ 1240.430027] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 1240.497168] RDX: 0000000000040400 RSI: 00000000000000f6 RDI: 0000000000000300
[ 1240.497168] RBP: ffff984908574440 R08: 0000000000000000 R09: 00000000000003a4
[ 1240.497168] R10: 0000000000000020 R11: ffffffffabc928ed R12: ffff984908574000
[ 1240.497168] R13: 0000000000000000 R14: 0000000000000000 R15: ffff98490be195b0
[ 1240.497168] FS:  0000000000000000(0000) GS:ffff98490be00000(0000) 
knlGS:0000000000000000
[ 1240.497168] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1240.497168] CR2: 00007f31cd4c0000 CR3: 0000000109bca000 CR4: 00000000000406f0
[ 1240.497168] Call Trace:
[ 1240.497168]  <IRQ>
[ 1240.497168]  ? qdisc_reset+0xe0/0xe0
[ 1240.497168]  call_timer_fn+0x26/0x130
[ 1240.497168]  run_timer_softirq+0x1cd/0x400
[ 1240.497168]  ? hpet_interrupt_handler+0x10/0x30
[ 1240.497168]  __do_softirq+0xed/0x2aa
[ 1240.497168]  irq_exit+0xb7/0xc0
[ 1240.497168]  do_IRQ+0x45/0xd0
[ 1240.497168]  common_interrupt+0xf/0xf
[ 1240.497168]  </IRQ>
[ 1240.497168] RIP: 0010:cpuidle_enter_state+0xa6/0x330
[ 1240.497168] Code: 65 8b 3d 1d b0 4d 55 e8 58 6a 95 ff 48 89 c3 66 66 66 66 
90 31 ff e8 59 73 95 ff 80 7c 24 0b 00 0f 85 25 02 00 00 fb 4c 29 eb <48> ba cf 
f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7 ea b8 ff
[ 1240.497168] RSP: 0018:ffffffffab603e60 EFLAGS: 00000216 ORIG_RAX: 
ffffffffffffffde
[ 1240.497168] RAX: ffff98490be20a80 RBX: 000000000081035c RCX: 00000120cf178c49
[ 1240.497168] RDX: 00000120cf178ca0 RSI: 00000120cf178ca0 RDI: 0000000000000000
[ 1240.497168] RBP: ffff984908fbd000 R08: fffffffb58ea5f9e R09: 000001208e0b48df
[ 1240.497168] R10: 00000000000018c4 R11: 0000000000002468 R12: 0000000000000002
[ 1240.497168] R13: 00000120ce968944 R14: ffffffffab6a68a0 R15: ffffffffab611740
[ 1240.497168]  do_idle+0x1da/0x230
[ 1240.497168]  cpu_startup_entry+0x6a/0x70
[ 1240.497168]  start_kernel+0x4a2/0x4c2
[ 1240.497168]  secondary_startup_64+0xa4/0xb0
[ 1240.497168] ---[ end trace c6f3be34c214db4e ]---

Notice the warning is referring to a different adapter. So I suspect that 
whatever happened it froze
All network adapters.

If you have suggestions let me know.

Or if you would like to see the kernel panics from older kernels let me know.

Regards,
Bryan

RE: [PATCH v1 net] lan743x: fix return value for lan743x_tx_napi_poll

Reply via email to