On Sun, Feb 18, 2018 at 12:01:02PM +0200, Denys Fedoryshchenko wrote: > On 2018-02-16 20:48, Guillaume Nault wrote: > > On Fri, Feb 16, 2018 at 01:13:18PM +0200, Denys Fedoryshchenko wrote: > > > As far as i can see there is only KASAN triggered again(and server > > > rebooted > > > shortly after that), but nothing else: > > > > > Ok, so no refcount failure detected. Not what I expected... but that's > > still an information. It's getting even harder to find a ppp scenario > > that could lead to such symptoms. > > If that's acceptable for you, you can try reverting the few commits > > that entered after 4.14. > > > > 02612bb05e51df8489db5e94d0cf8d1c81f87b0c pppoe: take ->needed_headroom > > of lower device into account on xmit > > 0171c41835591e9aa2e384b703ef9a6ae367c610 ppp: unlock all_ppp_mutex > > before registering device > > e6675000f9a404f7651724c0b2e2e71f7247d3a1 ppp: exit_net cleanup checks > > added > > f02b2320b27c16b644691267ee3b5c110846f49e ppp: Destroy the mutex when > > cleanup > > 90e229ef61fad240554f5899eb122fbe44990f78 ppp: allow usage in namespaces > > 709c89b45b874b2f81a074b8802a736009873f48 drivers, net, ppp: convert > > syncppp.refcnt from atomic_t to refcount_t > > d780cd44e3cea119a3346e6d7c04d35b9c50d54b drivers, net, ppp: convert > > ppp_file.refcnt from atomic_t to refcount_t > > 313a912155c78ed87ad6fca175dc56b75fd00a58 drivers, net, ppp: convert > > asyncppp.refcnt from atomic_t to refcount_t > > > > Sorry, but I have nothing better to propose for now. At least that > > should help narrowing the problem space. > > I'm going to stress test ppp_generic and pppoe on my side. > > > Quick update. > Testing 5 first patches didn't changed anything. > But revering more, with last 4 patches also (i did all together) is changing > things, probably i need to repeat one night more reverting just all > refcount_t patches. > So you got the following trace with all 8 patches reverted, right? I prefer to concentrate on the other traces for now. If this one tends to be reproducible, you can try to activate lockdep (for lack of better suggestion).
> [25222.173840] ------------[ cut here ]------------ > [25222.174259] NETDEV WATCHDOG: eth1 (ixgbe): transmit queue 3 timed out > [25222.174618] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:323 > dev_watchdog+0x44a/0x555 > [25222.175212] Modules linked in: pppoe pppox ppp_generic slhc netconsole > configfs coretemp nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp > nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x > t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4 xt_set > xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net > ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na > t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables > x_tables 8021q garp mrp stp llc ixgbe dca > [25222.177133] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G B W > 4.15.3-build-0134 #6 > [25222.184121] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 > 04/02/2015 > [25222.184457] RIP: 0010:dev_watchdog+0x44a/0x555 > [25222.184791] RSP: 0018:ffff8803f22c7d98 EFLAGS: 00010292 > [25222.185127] RAX: 0000000000000000 RBX: ffff8803ded00438 RCX: > 0000000000000000 > [25222.185463] RDX: 0000000000000001 RSI: 0000000000000002 RDI: > ffffed007e458fa8 > [25222.185797] RBP: ffff8803ded00000 R08: 0000000000000001 R09: > 0000000000000000 > [25222.186133] R10: ffff8803f22c7e30 R11: 0000000000000001 R12: > ffff8803ded28450 > [25222.186471] R13: 0000000000000003 R14: dffffc0000000000 R15: > ffff8803ded283c0 > [25222.186804] FS: 0000000000000000(0000) GS:ffff8803f22c0000(0000) > knlGS:0000000000000000 > [25222.187401] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [25222.187739] CR2: 0000561f5bffc128 CR3: 0000000445a0d003 CR4: > 00000000001606e0 > [25222.188077] Call Trace: > [25222.188410] <IRQ> > [25222.188740] ? dev_graft_qdisc+0xfa/0xfa > [25222.189072] call_timer_fn+0x15/0x72 > [25222.189407] ? dev_graft_qdisc+0xfa/0xfa > [25222.189741] expire_timers+0x1b9/0x1d5 > [25222.190072] run_timer_softirq+0x184/0x361 > [25222.190400] ? expire_timers+0x1d5/0x1d5 > [25222.190723] ? enqueue_hrtimer+0xce/0xd8 > [25222.191048] ? __hrtimer_run_queues+0x1ec/0x24d > [25222.191373] __do_softirq+0x17f/0x34a > [25222.191702] irq_exit+0x8f/0xf9 > [25222.192034] smp_apic_timer_interrupt+0xcb/0xd6 > [25222.192365] apic_timer_interrupt+0x92/0xa0 > [25222.192695] </IRQ> > [25222.193023] RIP: 0010:mwait_idle+0x99/0xac > [25222.193355] RSP: 0018:ffff8803f030fef8 EFLAGS: 00000246 ORIG_RAX: > ffffffffffffff11 > [25222.193956] RAX: 0000000000000000 RBX: ffff8803f02e3500 RCX: > 0000000000000000 > [25222.194290] RDX: 1ffff1007e05c6a0 RSI: 0000000000000000 RDI: > 0000000000000000 > [25222.194626] RBP: ffff8803f02e3500 R08: ffffed007ccc8eef R09: > ffff8803e6647728 > [25222.194958] R10: ffff8803f030fdd0 R11: 0000000000000001 R12: > 0000000000000000 > [25222.195292] R13: dffffc0000000000 R14: ffffed007e05c6a0 R15: > ffff8803f02e3500 > [25222.195627] do_idle+0xe6/0x19a > [25222.195963] cpu_startup_entry+0x18/0x1a > [25222.196295] secondary_startup_64+0xa5/0xb0 > [25222.196625] Code: 68 87 40 01 00 75 3f 48 89 ef c6 05 5c 87 40 01 01 e8 > 64 93 fa ff 44 89 e9 48 89 c2 48 89 ee 48 c7 c7 80 28 68 83 e8 25 69 6d fe > <0f> ff eb 17 41 ff c5 49 81 c4 40 0 > 1 00 00 44 3b 6c 24 04 0f 85 > [25222.197511] ---[ end trace 4b04e9c6754a1cd5 ]--- > > and then > > [25222.197853] ixgbe 0000:04:00.1 eth1: initiating reset due to tx timeout > [25222.198194] ixgbe 0000:04:00.1 eth1: Reset adapter > [25227.805896] ixgbe 0000:04:00.1 eth1: initiating reset due to tx timeout > [25232.925944] ixgbe 0000:04:00.1 eth1: initiating reset due to tx timeout > [25236.084968] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! > [accel-pppd:12627] > [25236.085562] Modules linked in: pppoe pppox ppp_generic slhc netconsole > configfs coretemp nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp > nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x > t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4 xt_set > xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net > ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na > t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables > x_tables 8021q garp mrp stp llc ixgbe dca > [25236.087496] CPU: 0 PID: 12627 Comm: accel-pppd Tainted: G B W > 4.15.3-build-0134 #6 > [25236.088095] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 > 04/02/2015 > [25236.088430] RIP: 0010:queued_spin_lock_slowpath+0xb1/0x418 > [25236.088759] RSP: 0018:ffff8803e6457a98 EFLAGS: 00000213 ORIG_RAX: > ffffffffffffff11 > [25236.089353] RAX: 00000000000001fb RBX: ffff880345e75fe0 RCX: > ffffffff811aeca3 > [25236.089685] RDX: 0000000000000000 RSI: 0000000000000004 RDI: > ffff880345e75fe0 > [25236.090026] RBP: ffffed0068bcebfc R08: 06030a0001012180 R09: > ffffed006cc9beb2 > [25236.090369] R10: ffffed006cc9beb3 R11: 0000000000000001 R12: > 0000000000000003 > [25236.090705] R13: 0000000000008021 R14: 0000000000008021 R15: > 00000000034e4b06 > [25236.091043] FS: 00007f94bd26c700(0000) GS:ffff8803f2200000(0000) > knlGS:0000000000000000 > [25236.091636] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [25236.091966] CR2: 00007ffc0935eff8 CR3: 00000003d709b003 CR4: > 00000000001606f0 > [25236.092304] Call Trace: > [25236.092638] ppp_push+0x112/0xdda [ppp_generic] > [25236.092975] ? enqueue_hrtimer+0xce/0xd8 > [25236.093304] ? hrtimer_start_range_ns+0x827/0x854 > [25236.093635] __ppp_xmit_process+0xc6a/0xdd5 [ppp_generic] > [25236.093969] ? __kmalloc_reserve.isra.5+0x29/0x96 > [25236.094302] ? memset+0x1f/0x31 > [25236.094631] ? ppp_receive_nonmp_frame+0x138c/0x138c [ppp_generic] > [25236.094962] ? __alloc_skb+0x2ec/0x431 > [25236.095292] ? __kmalloc_reserve.isra.5+0x96/0x96 > [25236.095620] ? timerfd_release+0x1d3/0x1d3 > [25236.095950] ppp_xmit_process+0xc3/0x194 [ppp_generic] > [25236.096284] ppp_write+0x1b7/0x1c3 [ppp_generic] > [25236.096617] __vfs_write+0xd9/0x4ad > [25236.096953] ? kernel_read+0xed/0xed > [25236.097283] ? vfs_copy_file_range+0x6a8/0x6a8 > [25236.097614] ? bit_waitqueue+0x2a/0x2a > [25236.097946] ? __fsnotify_inode_delete+0xc/0xc > [25236.098276] ? __fsnotify_inode_delete+0xc/0xc > [25236.098610] ? SyS_sendmmsg+0x13/0x13 > [25236.098936] vfs_write+0x18c/0x378 > [25236.099258] SyS_write+0xc4/0x13b > [25236.099579] ? SyS_read+0x13b/0x13b > [25236.099902] ? exit_to_usermode_loop+0x7c/0xaf > [25236.100225] ? SyS_read+0x13b/0x13b > [25236.100550] do_syscall_64+0x1b1/0x31f > [25236.100879] entry_SYSCALL_64_after_hwframe+0x21/0x86 > [25236.101210] RIP: 0033:0x7f94bca53b2d > [25236.101536] RSP: 002b:00007f94bd26bb80 EFLAGS: 00000293 ORIG_RAX: > 0000000000000001 > [25236.102127] RAX: ffffffffffffffda RBX: 00007f94bb59f1e3 RCX: > 00007f94bca53b2d > [25236.102461] RDX: 000000000000000c RSI: 00007f94b78895d0 RDI: > 0000000000002f92 > [25236.102793] RBP: 00007f94bd26bbb0 R08: 0000000000000030 R09: > 0000000000000027 > [25236.103127] R10: 0000000000000000 R11: 0000000000000293 R12: > 00007f94b6450eb8 > [25236.103460] R13: 00007ffc8c047a6f R14: 0000000000000000 R15: > 00007f94bd26c700 > [25236.103790] Code: 83 03 00 00 48 89 dd 49 89 dc 48 b8 00 00 00 00 00 fc > ff df 48 c1 ed 03 41 83 e4 07 48 01 c5 41 83 c4 03 8a 45 00 41 38 c4 7c 0c > <84> c0 74 08 48 89 df e8 31 54 17 0 > 0 8b 03 84 c0 74 04 f3 90 eb > > Then system autorebooted. > Maybe i am hitting some qdisc bug now...