On Wed, Aug 10, 2016 at 09:13:14AM +0800, Ding Tianhong wrote: > On 2016/6/16 22:19, Paul E. McKenney wrote: > > On Thu, Jun 16, 2016 at 02:09:47PM +0800, Ding Tianhong wrote: > >> On 2016/6/15 23:49, Paul E. McKenney wrote: > >>> On Wed, Jun 15, 2016 at 03:27:36PM +0800, Ding Tianhong wrote: > >>>> I met this problem when using the Testgine to send package to ixgbevf nic > >>>> by this steps: > >>>> 1. Connect to ixgbevf, and set the speed to 10Gb/s, it could work fine. > >>>> 2. Then use ifconfig to down the nic and up again, loop for several > >>>> times. > >>>> 3. The system panic by soft lockup. > >>> > >>> Good catch, queued for review and testing. But what .config was your > >>> kernel built with? > >>> > >> > >> I use the redhat7.1 defconfig to build my kernel, and the RCU config is > >> this: > >> 120 # > >> 121 # RCU Subsystem > >> 122 # > >> 123 CONFIG_TREE_RCU=y > >> 124 # CONFIG_PREEMPT_RCU is not set > >> 125 CONFIG_RCU_STALL_COMMON=y > >> 126 CONFIG_CONTEXT_TRACKING=y > >> 127 CONFIG_RCU_USER_QS=y > >> 128 # CONFIG_CONTEXT_TRACKING_FORCE is not set > >> 129 CONFIG_RCU_FANOUT=64 > >> 130 CONFIG_RCU_FANOUT_LEAF=16 > >> 131 # CONFIG_RCU_FANOUT_EXACT is not set > >> 132 # CONFIG_RCU_FAST_NO_HZ is not set > >> 133 # CONFIG_TREE_RCU_TRACE is not set > >> 134 CONFIG_RCU_NOCB_CPU=y > >> 135 CONFIG_RCU_NOCB_CPU_ALL=y > >> 136 CONFIG_BUILD_BIN2C=y > > > > Thank you! You were running with preemption disabled, so your system > > would indeed be very susceptible to this problem. > > > >>> Also, I did tweak both the commit log and the patch. Your cond_resched() > >>> would prevent soft lockups, but not RCU stalls, so I substituted > >>> cond_resched_rcu_qs(). Please let me know if either of those changes > >>> causes problems at your end. > >> > >> Looks fine to me, I will apply this to my branch and test it, thanks. > > > > Please let me know how it goes! > > > > Thanx, Paul > > > > Hi Paul: > > It has been a long time after applying this patch, and didn't found any > problem, I believe this patch is fine, thanks.
Very good! I will push this one upstream during the next merge window. Thanx, Paul > Ding > > >> Ding > >> > >>> > >>> Thanx, Paul > >>> > >>> ------------------------------------------------------------------------ > >>> > >>> commit c317cf19b34c0d2787b787c38bd2c8fe433215da > >>> Author: Ding Tianhong <dingtianh...@huawei.com> > >>> Date: Wed Jun 15 15:27:36 2016 +0800 > >>> > >>> rcu: Fix soft lockup for rcu_nocb_kthread > >>> > >>> Carrying out the following steps results in a softlockup in the > >>> RCU callback-offload (rcuo) kthreads: > >>> > >>> 1. Connect to ixgbevf, and set the speed to 10Gb/s. > >>> 2. Use ifconfig to bring the nic up and down repeatedly. > >>> > >>> [ 317.005148] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready > >>> [ 368.106005] BUG: soft lockup - CPU#1 stuck for 22s! [rcuos/1:15] > >>> [ 368.106005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > >>> [ 368.106005] task: ffff88057dd8a220 ti: ffff88057dd9c000 task.ti: > >>> ffff88057dd9c000 > >>> [ 368.106005] RIP: 0010:[<ffffffff81579e04>] [<ffffffff81579e04>] > >>> fib_table_lookup+0x14/0x390 > >>> [ 368.106005] RSP: 0018:ffff88061fc83ce8 EFLAGS: 00000286 > >>> [ 368.106005] RAX: 0000000000000001 RBX: 00000000020155c0 RCX: > >>> 0000000000000001 > >>> [ 368.106005] RDX: ffff88061fc83d50 RSI: ffff88061fc83d70 RDI: > >>> ffff880036d11a00 > >>> [ 368.106005] RBP: ffff88061fc83d08 R08: 0000000000000001 R09: > >>> 0000000000000000 > >>> [ 368.106005] R10: ffff880036d11a00 R11: ffffffff819e0900 R12: > >>> ffff88061fc83c58 > >>> [ 368.106005] R13: ffffffff816154dd R14: ffff88061fc83d08 R15: > >>> 00000000020155c0 > >>> [ 368.106005] FS: 0000000000000000(0000) GS:ffff88061fc80000(0000) > >>> knlGS:0000000000000000 > >>> [ 368.106005] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>> [ 368.106005] CR2: 00007f8c2aee9c40 CR3: 000000057b222000 CR4: > >>> 00000000000407e0 > >>> [ 368.106005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > >>> 0000000000000000 > >>> [ 368.106005] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: > >>> 0000000000000400 > >>> [ 368.106005] Stack: > >>> [ 368.106005] 00000000010000c0 ffff88057b766000 ffff8802e380b000 > >>> ffff88057af03e00 > >>> [ 368.106005] ffff88061fc83dc0 ffffffff815349a6 ffff88061fc83d40 > >>> ffffffff814ee146 > >>> [ 368.106005] ffff8802e380af00 00000000e380af00 ffffffff819e0900 > >>> 020155c0010000c0 > >>> [ 368.106005] Call Trace: > >>> [ 368.106005] <IRQ> > >>> [ 368.106005] > >>> [ 368.106005] [<ffffffff815349a6>] ip_route_input_noref+0x516/0xbd0 > >>> [ 368.106005] [<ffffffff814ee146>] ? skb_release_data+0xd6/0x110 > >>> [ 368.106005] [<ffffffff814ee20a>] ? kfree_skb+0x3a/0xa0 > >>> [ 368.106005] [<ffffffff8153698f>] ip_rcv_finish+0x29f/0x350 > >>> [ 368.106005] [<ffffffff81537034>] ip_rcv+0x234/0x380 > >>> [ 368.106005] [<ffffffff814fd656>] > >>> __netif_receive_skb_core+0x676/0x870 > >>> [ 368.106005] [<ffffffff814fd868>] __netif_receive_skb+0x18/0x60 > >>> [ 368.106005] [<ffffffff814fe4de>] process_backlog+0xae/0x180 > >>> [ 368.106005] [<ffffffff814fdcb2>] net_rx_action+0x152/0x240 > >>> [ 368.106005] [<ffffffff81077b3f>] __do_softirq+0xef/0x280 > >>> [ 368.106005] [<ffffffff8161619c>] call_softirq+0x1c/0x30 > >>> [ 368.106005] <EOI> > >>> [ 368.106005] > >>> [ 368.106005] [<ffffffff81015d95>] do_softirq+0x65/0xa0 > >>> [ 368.106005] [<ffffffff81077174>] local_bh_enable+0x94/0xa0 > >>> [ 368.106005] [<ffffffff81114922>] rcu_nocb_kthread+0x232/0x370 > >>> [ 368.106005] [<ffffffff81098250>] ? wake_up_bit+0x30/0x30 > >>> [ 368.106005] [<ffffffff811146f0>] ? rcu_start_gp+0x40/0x40 > >>> [ 368.106005] [<ffffffff8109728f>] kthread+0xcf/0xe0 > >>> [ 368.106005] [<ffffffff810971c0>] ? > >>> kthread_create_on_node+0x140/0x140 > >>> [ 368.106005] [<ffffffff816147d8>] ret_from_fork+0x58/0x90 > >>> [ 368.106005] [<ffffffff810971c0>] ? > >>> kthread_create_on_node+0x140/0x140 > >>> > >>> ==================================cut > >>> here============================== > >>> > >>> It turns out that the rcuos callback-offload kthread is busy > >>> processing > >>> a very large quantity of RCU callbacks, and it is not reliquishing the > >>> CPU while doing so. This commit therefore adds an > >>> cond_resched_rcu_qs() > >>> within the loop to allow other tasks to run. > >>> > >>> Signed-off-by: Ding Tianhong <dingtianh...@huawei.com> > >>> [ paulmck: Substituted cond_resched_rcu_qs for cond_resched. ] > >>> Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com> > >>> > >>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > >>> index 0082fce402a0..85c5a883c6e3 100644 > >>> --- a/kernel/rcu/tree_plugin.h > >>> +++ b/kernel/rcu/tree_plugin.h > >>> @@ -2173,6 +2173,7 @@ static int rcu_nocb_kthread(void *arg) > >>> cl++; > >>> c++; > >>> local_bh_enable(); > >>> + cond_resched_rcu_qs(); > >>> list = next; > >>> } > >>> trace_rcu_batch_end(rdp->rsp->name, c, !!list, 0, 0, 1); > >>> > >>> > >>> . > >>> > >> > >> > > > > > > . > > > >