On Tue, Aug 26, 2014 at 11:43 PM, Joe Stringer <joestrin...@nicira.com> wrote: > I've been having a bit of trouble with ovs after this patch was applied, > ovs-vswitchd locks up on Debian 7.6 (VBox VM). Kernel is 3.2.0-4-amd64 #1 > SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux. > > I have a fairly typical phy bridge + int bridge setup, with dhclient running > on br-int: > > # ovs-vsctl show > a9624cfa-2ae8-42ee-9275-3f5b022d09a5 > Bridge "breth0" > Port "breth0+" > Interface "breth0+" > type: patch > options: {peer="breth0-"} > Port "eth0" > Interface "eth0" > Port "breth0" > Interface "breth0" > type: internal > Bridge br-int > Port br-int > Interface br-int > type: internal > Port "breth0-" > Interface "breth0-" > type: patch > options: {peer="breth0+"} > ovs_version: "2.3.90" > > From one terminal, I send a bunch of traffic to a neighbour: > # ovs-appctl upcall/disable-megaflows && hping3 --flood 10.0.2.2 > > From another terminal, I observe the upcall output: > # watch -n 0.2 'ovs-appctl upcall/show' > > After 10-20 seconds, the output in the second terminal freezes. Around 2-5 > minutes later, dmesg starts complaining about blocked threads: > > [9957.804271] openvswitch: Open vSwitch switching datapath 2.3.90, built Aug > 27 2014 14:15:14 > [10088.874770] device ovs-system entered promiscuous mode > [10088.883911] openvswitch: netlink: Key attribute has unexpected length > (type=21, length=8, expected=4). > [10088.913387] device br-int entered promiscuous mode > [10088.917780] device eth0 entered promiscuous mode > [10088.946122] device breth0 entered promiscuous mode > [10117.536276] ADDRCONF(NETDEV_UP): eth0: link is not ready > [10117.576174] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow > Control: RX > [10117.576174] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready > [10127.808269] eth0: no IPv6 routers present > [10131.104185] br-int: no IPv6 routers present > [10133.976474] breth0: no IPv6 routers present > [10321.389028] INFO: task ovs-vswitchd:7073 blocked for more than 120 > seconds. > [10321.389028] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [10321.389028] ovs-vswitchd D ffff88011fc13780 0 7073 1 > 0x00000000 > [10321.389028] ffff8800c398e800 0000000000000082 0000000300000246 > ffff8801178610c0 > [10321.389028] 0000000000013780 ffff8800ca8bffd8 ffff8800ca8bffd8 > ffff8800c398e800 > [10321.389028] ffff8800c398e800 0000000000000000 00000000ffffffff > ffffffff8165c0f0 > [10321.389028] Call Trace: > [10321.389028] [<ffffffff8134fa44>] ? __mutex_lock_common.isra.5+0xff/0x164 > [10321.389028] [<ffffffff8134f932>] ? mutex_lock+0x1a/0x2d > [10321.389028] [<ffffffff812af40c>] ? netlink_trim+0x13/0x7a > [10321.389028] [<ffffffff812b1178>] ? genl_rcv+0xe/0x28 > [10321.389028] [<ffffffff812b0929>] ? netlink_unicast+0xe6/0x14e > [10321.389028] [<ffffffff812b0c05>] ? netlink_sendmsg+0x274/0x2ac > [10321.389028] [<ffffffff81280011>] ? sock_sendmsg+0xc1/0xde > [10321.389028] [<ffffffff8127fef0>] ? sock_recvmsg+0xcd/0xec > [10321.389028] [<ffffffff8110cdbf>] ? file_update_time+0x30/0x105 > [10321.389028] [<ffffffff81036628>] ? should_resched+0x5/0x23 > [10321.389028] [<ffffffff8134f144>] ? _cond_resched+0x7/0x1c > [10321.389028] [<ffffffff812894a8>] ? copy_from_user+0x18/0x30 > [10321.389028] [<ffffffff81036628>] ? should_resched+0x5/0x23 > [10321.389028] [<ffffffff812802c7>] ? ___sys_sendmsg+0x209/0x2a9 > [10321.389028] [<ffffffff81037f8b>] ? __wake_up+0x35/0x46 > [10321.389028] [<ffffffff8135049f>] ? _raw_spin_unlock_irqrestore+0xe/0xf > [10321.389028] [<ffffffff812afe3d>] ? netlink_insert+0x109/0x130 > [10321.389028] [<ffffffff81036628>] ? should_resched+0x5/0x23 > [10321.389028] [<ffffffff8134f144>] ? _cond_resched+0x7/0x1c > [10321.389028] [<ffffffff8128171e>] ? sys_getsockname+0x8e/0xb7 > [10321.389028] [<ffffffff810fc0b3>] ? fget_light+0x67/0x7b > [10321.389028] [<ffffffff81281c37>] ? __sys_sendmsg+0x39/0x58 > [10321.389028] [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b > [10321.389028] INFO: task handler10:7112 blocked for more than 120 seconds. > [10321.389028] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [10321.389028] handler10 D ffff8800d2c59800 0 7112 1 > 0x00000000 > [10321.389028] ffff8800d2c59800 0000000000000082 0000000000000000 > ffff880118292800 > [10321.389028] 0000000000013780 ffff8800d0f73fd8 ffff8800d0f73fd8 > ffff8800d2c59800 > [10321.389028] ffffffff8103661f 0000000000000010 0000000000000246 > ffffffff8165c0f0 > [10321.389028] Call Trace: > [10321.389028] [<ffffffff8103661f>] ? need_resched+0x14/0x18 > [10321.389028] [<ffffffff8134fa44>] ? __mutex_lock_common.isra.5+0xff/0x164 > [10321.389028] [<ffffffff8134f932>] ? mutex_lock+0x1a/0x2d > [10321.389028] [<ffffffff812af40c>] ? netlink_trim+0x13/0x7a > [10321.389028] [<ffffffff812b1178>] ? genl_rcv+0xe/0x28 > [10321.389028] [<ffffffff812b0929>] ? netlink_unicast+0xe6/0x14e > [10321.389028] [<ffffffff812b1192>] ? genl_rcv+0x28/0x28 > [10321.389028] [<ffffffff812b0e0f>] ? netlink_rcv_skb+0x53/0x7a > [10321.389028] [<ffffffff812b1189>] ? genl_rcv+0x1f/0x28 > [10321.389028] [<ffffffff812b0929>] ? netlink_unicast+0xe6/0x14e > [10321.389028] [<ffffffff812b0c05>] ? netlink_sendmsg+0x274/0x2ac > [10321.389028] [<ffffffff81280011>] ? sock_sendmsg+0xc1/0xde > [10321.389028] [<ffffffff8134f144>] ? _cond_resched+0x7/0x1c > [10321.389028] [<ffffffff810be188>] ? update_page_reclaim_stat+0x17/0x44 > [10321.389028] [<ffffffff812833dc>] ? sock_kmalloc+0x41/0x63 > [10321.389028] [<ffffffff812833dc>] ? sock_kmalloc+0x41/0x63 > [10321.389028] [<ffffffff810ecd8f>] ? __kmalloc+0x100/0x112 > [10321.389028] [<ffffffff81036628>] ? should_resched+0x5/0x23 > [10321.389028] [<ffffffff8134f144>] ? _cond_resched+0x7/0x1c > [10321.389028] [<ffffffff812894a8>] ? copy_from_user+0x18/0x30 > [10321.389028] [<ffffffff812802c7>] ? ___sys_sendmsg+0x209/0x2a9 > [10321.389028] [<ffffffff810ceb49>] ? pte_offset_kernel+0x16/0x35 > [10321.389028] [<ffffffff813533d8>] ? do_page_fault+0x30a/0x345 > [10321.389028] [<ffffffff8135049f>] ? _raw_spin_unlock_irqrestore+0xe/0xf > [10321.389028] [<ffffffff8112a6db>] ? ep_scan_ready_list.isra.7+0x13b/0x15b > [10321.389028] [<ffffffff8100d750>] ? __switch_to+0x1e5/0x258 > [10321.389028] [<ffffffff8112b2b6>] ? sys_epoll_wait+0x29f/0x308 > [10321.389028] [<ffffffff81039a92>] ? finish_task_switch+0x4e/0xb9 > [10321.389028] [<ffffffff810fc0b3>] ? fget_light+0x67/0x7b > [10321.389028] [<ffffffff81281c37>] ? __sys_sendmsg+0x39/0x58 > [10321.389028] [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b > [10321.389028] INFO: task handler12:7113 blocked for more than 120 seconds. > [10321.389028] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [10321.389028] handler12 D ffff88011fc93780 0 7113 1 > 0x00000000 > [10321.389028] ffff8800c3a1d080 0000000000000082 dead000000000000 > ffff88011aeb0180 > [10321.389028] 0000000000013780 ffff8800dba07fd8 ffff8800dba07fd8 > ffff8800c3a1d080 > [10321.389028] ffff8800ca807b98 00000001ca807b48 0000000000000041 > ffffffff8165c0f0 > [10321.389028] Call Trace: > [10321.389028] [<ffffffff8134fa44>] ? __mutex_lock_common.isra.5+0xff/0x164 > [10321.389028] [<ffffffff8134f932>] ? mutex_lock+0x1a/0x2d > [10321.389028] [<ffffffff812af40c>] ? netlink_trim+0x13/0x7a > [10321.389028] [<ffffffff812b1178>] ? genl_rcv+0xe/0x28 > [10321.389028] [<ffffffff812b0929>] ? netlink_unicast+0xe6/0x14e > [10321.389028] [<ffffffff812b0c05>] ? netlink_sendmsg+0x274/0x2ac > [10321.389028] [<ffffffff81280011>] ? sock_sendmsg+0xc1/0xde > [10321.389028] [<ffffffff81108b29>] ? __pollwait+0xce/0xce > [10321.389028] [<ffffffff8107116d>] ? arch_local_irq_save+0x11/0x17 > [10321.389028] [<ffffffff8135049f>] ? _raw_spin_unlock_irqrestore+0xe/0xf > [10321.389028] [<ffffffff8112a7c8>] ? ep_poll_callback+0xbf/0xdb > [10321.389028] [<ffffffff81036628>] ? should_resched+0x5/0x23 > [10321.389028] [<ffffffff8134f144>] ? _cond_resched+0x7/0x1c > [10321.389028] [<ffffffff812894a8>] ? copy_from_user+0x18/0x30 > [10321.389028] [<ffffffff81036628>] ? should_resched+0x5/0x23 > [10321.389028] [<ffffffff812802c7>] ? ___sys_sendmsg+0x209/0x2a9 > [10321.389028] [<ffffffff810ceb49>] ? pte_offset_kernel+0x16/0x35 > [10321.389028] [<ffffffff813533d8>] ? do_page_fault+0x30a/0x345 > [10321.389028] [<ffffffff81036628>] ? should_resched+0x5/0x23 > [10321.389028] [<ffffffff8134f144>] ? _cond_resched+0x7/0x1c > [10321.389028] [<ffffffff8128171e>] ? sys_getsockname+0x8e/0xb7 > [10321.389028] [<ffffffff810fc0b3>] ? fget_light+0x67/0x7b > [10321.389028] [<ffffffff81281c37>] ? __sys_sendmsg+0x39/0x58 > [10321.389028] [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b > [10321.389028] INFO: task revalidator13:7114 blocked for more than 120 > seconds. > [10321.389028] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [10321.389028] revalidator13 D ffff88011fd13780 0 7114 1 > 0x00000000 > [10321.389028] ffff8800cabf2040 0000000000000082 0000000000000000 > ffff88011aec4780 > [10321.389028] 0000000000013780 ffff8800db83ffd8 ffff8800db83ffd8 > ffff8800cabf2040 > [10321.389028] ffffffff8165c0f0 00000001ffffff10 ffffffff81040480 > ffffffff8165c0f0 > [10321.389028] Call Trace: > [10321.389028] [<ffffffff81040480>] ? mutex_spin_on_owner+0x24/0x39 > [10321.389028] [<ffffffff8134fa44>] ? __mutex_lock_common.isra.5+0xff/0x164 > [10321.389028] [<ffffffff8134f932>] ? mutex_lock+0x1a/0x2d > [10321.389028] [<ffffffff812af40c>] ? netlink_trim+0x13/0x7a > [10321.389028] [<ffffffff812b1178>] ? genl_rcv+0xe/0x28 > [10321.389028] [<ffffffff812b0929>] ? netlink_unicast+0xe6/0x14e > [10321.389028] [<ffffffff812b0c05>] ? netlink_sendmsg+0x274/0x2ac > [10321.389028] [<ffffffff81280011>] ? sock_sendmsg+0xc1/0xde > [10321.389028] [<ffffffff8103f6d2>] ? try_to_wake_up+0x187/0x197 > [10321.389028] [<ffffffff8106d63a>] ? wake_futex+0x5d/0x69 > [10321.389028] [<ffffffff8106d5ce>] ? put_task_struct+0xd/0x1c > [10321.389028] [<ffffffff8106e397>] ? futex_wake+0xe9/0xfb > [10321.389028] [<ffffffff810fc0b3>] ? fget_light+0x67/0x7b > [10321.389028] [<ffffffff812818f8>] ? sys_sendto+0xf7/0x137 > [10321.389028] [<ffffffff8106fcef>] ? sys_futex+0x120/0x151 > [10321.389028] [<ffffffff81066311>] ? timekeeping_get_ns+0xd/0x2a > [10321.389028] [<ffffffff810667ff>] ? ktime_get_ts+0x5c/0x82 > [10321.389028] [<ffffffff81036628>] ? should_resched+0x5/0x23 > [10321.389028] [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b > [10327.256206] br-int: no IPv6 routers present > > These repeat every 120 seconds, and I can no longer kill ovs-vswitchd or > unload the ovs kmod. > > Reverting this patch seems to fix the issue. I tried some basic things (like > shifting the sf_acts dereference within the disable_bh() section), but I > can't figure out what's wrong. Perhaps some odd interaction with this > particular kernel? > This is weird. I could not reproduce this on newer kernel. But I found related bug. Can you try attached patch? It may be related. I will submit formal patch later.
--Pravin. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev