On Wed, Oct 25, 2017 at 7:07 PM, Alexei Starovoitov <alexei.starovoi...@gmail.com> wrote: > > On Thu, Sep 28, 2017 at 04:36:58PM -0700, Yuchung Cheng wrote: > > On Thu, Sep 28, 2017 at 1:14 AM, Oleksandr Natalenko > > <oleksa...@natalenko.name> wrote: > > > Hi. > > > > > > Won't tell about panic in tcp_sacktag_walk() since I cannot trigger it > > > intentionally, but setting net.ipv4.tcp_retrans_collapse to 0 *does not* > > > fix > > > warning in tcp_fastretrans_alert() for me. > > > > Hi Oleksandr: no retrans_collapse should not matter for that warning > > in tcp_fstretrans_alert(). the warning as I explained earlier is > > likely false. Neal and I are more concerned the panic in > > tcp_sacktag_walk. This is just a blind shot but thx for retrying. > > > > We can submit a one-liner to remove the fast retrans warning but want > > to nail the bigger issue first. > > we're still seeing the warnings followed by crashes and it's very concerning. > We hoped that most recent Neal's patches from Sep 18 around this area may > magically fix the issue, but no. The panics are still there. > It's confirmed that net.ipv4.tcp_retrans_collapse=0 does not help > whereas net.ipv4.tcp_recovery=0 works, but obviously undesirable. > We're out of ideas on how to debug this. Can you try Eric's latest SACK rb-tree patches? https://patchwork.ozlabs.org/cover/822218/
Roman's SNMP data suggests MTU probing is enabled. Another blind shot is to disable it. > > Warning + panic: > [ 7678.309043] Call Trace: > [ 7678.313925] <IRQ> > [ 7678.317955] dump_stack+0x4d/0x70 > [ 7678.324573] __warn+0xd3/0xf0 > [ 7678.330497] warn_slowpath_null+0x1e/0x20 > [ 7678.338510] tcp_fastretrans_alert+0xacf/0xbd0 > [ 7678.347380] tcp_ack+0xbce/0x1390 > [ 7678.354006] tcp_rcv_established+0x1ce/0x740 > [ 7678.362536] tcp_v6_do_rcv+0x15b/0x400 > [ 7678.370025] tcp_v6_rcv+0x94c/0x9f0 > [ 7678.376989] ? tcp_rcv_established+0x42b/0x740 > [ 7678.385866] ip6_input_finish+0xea/0x430 > [ 7678.393702] ip6_input+0x32/0xa0 > [ 7678.400155] ? ip6_rcv_finish+0xa0/0xa0 > [ 7678.407834] ip6_rcv_finish+0x4b/0xa0 > [ 7678.415164] ipv6_rcv+0x2ec/0x4f0 > [ 7678.421778] ? ip6_make_skb+0x1c0/0x1c0 > [ 7678.429447] __netif_receive_skb_core+0x2d5/0x9a0 > [ 7678.438845] ? tcp6_gro_receive+0x11d/0x1c0 > [ 7678.447211] ? ipv6_gro_receive+0x1d0/0x380 > [ 7678.455566] __netif_receive_skb+0x16/0x70 > [ 7678.463748] netif_receive_skb_internal+0x32/0x3b0 > [ 7678.473318] ? dev_gro_receive+0x27c/0x470 > [ 7678.481502] napi_gro_receive+0x113/0x1d0 > [ 7678.489510] mlx5e_handle_rx_cqe_mpwrq+0x5b6/0x8d0 > [ 7678.499077] mlx5e_poll_rx_cq+0x79/0x830 > [ 7678.506911] mlx5e_napi_poll+0x8c/0x3d0 > [ 7678.514579] ? mlx5_cq_completion+0x54/0xb0 > [ 7678.522950] net_rx_action+0x22e/0x380 > [ 7678.530450] __do_softirq+0x106/0x2e8 > [ 7678.537768] irq_exit+0xb0/0xc0 > [ 7678.544042] do_IRQ+0x4f/0xd0 > [ 7678.549970] common_interrupt+0x86/0x86 > ... > [ 7685.690697] BUG: unable to handle kernel > [ 7685.698737] NULL pointer dereference > [ 7685.705889] at 0000000000000035 > [ 7685.712357] IP: tcp_sacktag_walk+0xfd/0x460 > [ 7685.720726] PGD 1002d75067 > [ 7685.726320] PUD 102c506067 > [ 7685.731902] PMD 0 > ... > [ 7686.070541] task: ffff880ff5959b00 task.stack: ffffc90024708000 > [ 7686.082363] RIP: 0010:tcp_sacktag_walk+0xfd/0x460 > [ 7686.091754] RSP: 0000:ffff88203eec38b0 EFLAGS: 00010217 > [ 7686.102188] RAX: ffff881953620a00 RBX: ffff881009273200 RCX: > 00000000fbfa50ec > [ 7686.116439] RDX: 00000000fbfa54a0 RSI: 0000000000000000 RDI: > ffff881009273200 > [ 7686.130688] RBP: ffff88203eec3908 R08: 0000000000000048 R09: > 00000000fbfa55fb > [ 7686.144936] R10: 0000000000000000 R11: 0000000000000000 R12: > ffff881ec9ff98c0 > [ 7686.159184] R13: ffff88203eec3a10 R14: 0000000000000000 R15: > ffff881ec9ff9a18 > [ 7686.173431] FS: 00007f2b235ff700(0000) GS:ffff88203eec0000(0000) > knlGS:0000000000000000 > [ 7686.189589] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 7686.201062] CR2: 0000000000000035 CR3: 0000000fd8229000 CR4: > 00000000003406e0 > [ 7686.215310] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [ 7686.229560] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [ 7686.243807] Call Trace: > [ 7686.248696] <IRQ> > [ 7686.252727] tcp_sacktag_write_queue+0x5ff/0x9e0 > [ 7686.261943] tcp_ack+0x677/0x1390 > [ 7686.268569] ? select_task_rq_fair+0x9e3/0xf70 > [ 7686.277437] ? tcp_write_xmit+0x230/0x10a0 > [ 7686.285619] tcp_rcv_established+0x1ce/0x740 > [ 7686.294151] tcp_v6_do_rcv+0x15b/0x400 > [ 7686.301636] tcp_v6_rcv+0x94c/0x9f0 > [ 7686.308606] ? default_wake_function+0x12/0x20 > [ 7686.317483] ? __wake_up_common+0x53/0x90 > [ 7686.325492] ip6_input_finish+0xea/0x430 > [ 7686.333329] ip6_input+0x32/0xa0 > [ 7686.339791] ? ip6_rcv_finish+0xa0/0xa0 > [ 7686.347451] ip6_rcv_finish+0x4b/0xa0 > [ 7686.354767] ipv6_rcv+0x2ec/0x4f0 > [ 7686.361389] ? ip6_make_skb+0x1c0/0x1c0 > [ 7686.369050] __netif_receive_skb_core+0x2d5/0x9a0 > [ 7686.378449] ? tcp6_gro_receive+0x11d/0x1c0 > [ 7686.386797] ? ipv6_gro_receive+0x1d0/0x380 > [ 7686.395160] __netif_receive_skb+0x16/0x70 > [ 7686.403342] netif_receive_skb_internal+0x32/0x3b0 > [ 7686.412931] ? dev_gro_receive+0x27c/0x470 > [ 7686.421110] napi_gro_receive+0x113/0x1d0 > [ 7686.429136] mlx5e_handle_rx_cqe_mpwrq+0x5b6/0x8d0 > [ 7686.438722] mlx5e_poll_rx_cq+0x79/0x830 > [ 7686.446572] mlx5e_napi_poll+0x8c/0x3d0 > [ 7686.454233] ? mlx5_cq_completion+0x54/0xb0 > [ 7686.462591] net_rx_action+0x22e/0x380 > [ 7686.470099] __do_softirq+0x106/0x2e8 > [ 7686.477431] irq_exit+0xb0/0xc0 > [ 7686.483700] do_IRQ+0x4f/0xd0 > [ 7686.489629] common_interrupt+0x86/0x86 > > Sometimes soft lockup is also seen: > > [2676627.345885] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! > [swapper/0:0] > [2676627.717771] RIP: 0010:queued_spin_lock_slowpath+0x15c/0x180 > [2676627.729239] RSP: 0018:ffff88103f603e78 EFLAGS: 00000202 > [2676627.740019] ORIG_RAX: ffffffffffffff10 > [2676627.748027] RAX: 0000000000000101 RBX: ffff8811181d8840 RCX: > 0000000000000101 > [2676627.762622] RDX: 0000000000000101 RSI: 0000000000000001 RDI: > ffff8811181d88c8 > [2676627.777217] RBP: ffff88103f603e88 R08: 0000000000000001 R09: > ffff88103f95a800 > [2676627.791811] R10: 0000000000037a13 R11: 000000001a37bb20 R12: > 0000000000000100 > [2676627.806405] R13: ffff8811181d8c18 R14: ffffffff81857270 R15: > ffff8811181d8840 > [2676627.821000] FS: 0000000000000000(0000) GS:ffff88103f600000(0000) > knlGS:0000000000000000 > [2676627.837500] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [2676627.849318] CR2: 00007f6e44c1f000 CR3: 0000000001e09000 CR4: > 00000000003406f0 > [2676627.863916] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [2676627.878508] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [2676627.893103] Call Trace: > [2676627.898338] <IRQ> > [2676627.902711] ? _raw_spin_lock+0x21/0x30 > [2676627.910715] tcp_write_timer+0x19/0x70 > [2676627.918548] call_timer_fn+0x35/0x150 > [2676627.926214] run_timer_softirq+0x1be/0x3e0 > [2676627.934742] ? timerqueue_add+0x59/0xb0 > [2676627.942745] ? ktime_get+0x3c/0xb0 > [2676627.949889] __do_softirq+0x106/0x2e8 > [2676627.957551] irq_exit+0xb0/0xc0 > [2676627.964174] smp_apic_timer_interrupt+0x68/0x90 > [2676627.973563] apic_timer_interrupt+0x86/0x90 > [2676628.084861] NMI watchdog: BUG: soft lockup - CPU#11 stuck for 21s! > [swapper/11:0] > [2676628.084896] RIP: 0010:tcp_match_skb_to_sack+0x2d/0xb0 > [2676628.084897] RSP: 0018:ffff88103f8c38a8 EFLAGS: 00000297 > [2676628.084897] ORIG_RAX: ffffffffffffff10 > [2676628.084898] RAX: 0000000000000000 RBX: ffff88010f9dbb18 RCX: > 000000002d5da350 > [2676628.084899] RDX: 000000002d5da141 RSI: ffff88010f9dbb18 RDI: > ffff8811181d8840 > [2676628.084899] RBP: ffff88103f8c3908 R08: 00000000ffffffff R09: > 00000000d2a25ebe > [2676628.084900] R10: 000000000000e323 R11: ffff8811181d8998 R12: > ffff8811181d8840 > [2676628.084900] R13: ffff88103f8c3a10 R14: 0000000000000000 R15: > ffff8811181d8998 > [2676628.084901] FS: 0000000000000000(0000) GS:ffff88103f8c0000(0000) > knlGS:0000000000000000 > [2676628.084902] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [2676628.084903] CR2: 00007fbc07410000 CR3: 0000000001e09000 CR4: > 00000000003406e0 > [2676628.084903] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [2676628.084904] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [2676628.084904] Call Trace: > [2676628.084905] <IRQ> > [2676628.084907] ? tcp_sacktag_walk+0x270/0x460 > [2676628.084909] tcp_sacktag_write_queue+0x5ff/0x9e0 > [2676628.084911] tcp_ack+0x677/0x1390 > [2676628.084913] ? tcp_write_xmit+0x230/0x10a0 > [2676628.084915] tcp_rcv_established+0x1ce/0x740 > [2676628.084919] tcp_v6_do_rcv+0x15b/0x400 > [2676628.084921] tcp_v6_rcv+0x94c/0x9f0 > [2676628.084924] ? __dev_queue_xmit+0x251/0x650 > [2676628.084928] ip6_input_finish+0xea/0x430 > [2676628.084930] ip6_input+0x32/0xa0 > [2676628.084932] ? ip6_rcv_finish+0xa0/0xa0 > [2676628.084933] ip6_rcv_finish+0x4b/0xa0 > [2676628.084935] ipv6_rcv+0x2ec/0x4f0 > [2676628.084937] ? ip6_make_skb+0x1c0/0x1c0 > [2676628.084938] __netif_receive_skb_core+0x2d5/0x9a0 > [2676628.084941] ? tcp6_gro_receive+0x11d/0x1c0 > [2676628.084943] ? ipv6_gro_receive+0x1d0/0x380 > [2676628.084944] __netif_receive_skb+0x16/0x70 > [2676628.084945] netif_receive_skb_internal+0x32/0x3b0 > [2676628.084947] ? dev_gro_receive+0x27c/0x470 > [2676628.084948] napi_gro_receive+0x113/0x1d0 > [2676628.084952] mlx5e_handle_rx_cqe_mpwrq+0x5b6/0x8d0 > [2676628.084954] mlx5e_poll_rx_cq+0x79/0x830 > [2676628.084956] mlx5e_napi_poll+0x8c/0x3d0 > [2676628.084959] ? mlx5_cq_completion+0x54/0xb0 > [2676628.084960] net_rx_action+0x22e/0x380 > [2676628.084962] __do_softirq+0x106/0x2e8 > [2676628.084965] irq_exit+0xb0/0xc0 > [2676628.084967] do_IRQ+0x4f/0xd0 > [2676656.083964] NMI watchdog: BUG: soft lockup - CPU#11 stuck for 23s! > [swapper/11:0] > [2676656.083992] RIP: 0010:tcp_sacktag_walk+0x25f/0x460 > [2676656.083993] RSP: 0018:ffff88103f8c38b0 EFLAGS: 00000207 > [2676656.083993] ORIG_RAX: ffffffffffffff10 > [2676656.083994] RAX: 0000000000021500 RBX: ffff8810d84c3600 RCX: > 000000002d5d9e41 > [2676656.083994] RDX: 00000000f15de229 RSI: ffff8810d84c0200 RDI: > ffff8811181d8840 > [2676656.083995] RBP: ffff88103f8c3908 R08: 000000000000ff21 R09: > 0000000000000000 > [2676656.083996] R10: 000000000000e323 R11: ffff8811181d8998 R12: > ffff8811181d8840 > [2676656.083996] R13: ffff88103f8c3a10 R14: 0000000000000000 R15: > ffff8811181d8998 > [2676656.083997] FS: 0000000000000000(0000) GS:ffff88103f8c0000(0000) > knlGS:0000000000000000 > [2676656.083998] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [2676656.083998] CR2: 00007fbc07410000 CR3: 0000000001e09000 CR4: > 00000000003406e0 > [2676656.083999] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [2676656.083999] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [2676656.084000] Call Trace: > [2676656.084001] <IRQ> > [2676656.084003] tcp_sacktag_write_queue+0x5ff/0x9e0 > [2676656.084005] tcp_ack+0x677/0x1390 > [2676656.084007] ? tcp_write_xmit+0x230/0x10a0 > [2676656.084009] tcp_rcv_established+0x1ce/0x740 > [2676656.084013] tcp_v6_do_rcv+0x15b/0x400 > [2676656.084015] tcp_v6_rcv+0x94c/0x9f0 > [2676656.084018] ? __dev_queue_xmit+0x251/0x650 > [2676656.084022] ip6_input_finish+0xea/0x430 > [2676656.084024] ip6_input+0x32/0xa0 > [2676656.084026] ? ip6_rcv_finish+0xa0/0xa0 > [2676656.084027] ip6_rcv_finish+0x4b/0xa0 > [2676656.084029] ipv6_rcv+0x2ec/0x4f0 > [2676656.084031] ? ip6_make_skb+0x1c0/0x1c0 > [2676656.084032] __netif_receive_skb_core+0x2d5/0x9a0 > [2676656.084035] ? tcp6_gro_receive+0x11d/0x1c0 > [2676656.084037] ? ipv6_gro_receive+0x1d0/0x380 > [2676656.084038] __netif_receive_skb+0x16/0x70 > [2676656.084039] netif_receive_skb_internal+0x32/0x3b0 > [2676656.084041] ? dev_gro_receive+0x27c/0x470 > [2676656.084042] napi_gro_receive+0x113/0x1d0 > [2676656.084046] mlx5e_handle_rx_cqe_mpwrq+0x5b6/0x8d0 > [2676656.084048] mlx5e_poll_rx_cq+0x79/0x830 > [2676656.084050] mlx5e_napi_poll+0x8c/0x3d0 > > Any help greatly appreciated. >