On Tue, Aug 26, 2014 at 11:43 PM, Joe Stringer <joestrin...@nicira.com> wrote:
> I've been having a bit of trouble with ovs after this patch was applied,
> ovs-vswitchd locks up on Debian 7.6 (VBox VM). Kernel is 3.2.0-4-amd64 #1
> SMP Debian 3.2.60-1+deb7u3 x86_64 GNU/Linux.
>
> I have a fairly typical phy bridge + int bridge setup, with dhclient running
> on br-int:
>
> # ovs-vsctl show
> a9624cfa-2ae8-42ee-9275-3f5b022d09a5
>     Bridge "breth0"
>         Port "breth0+"
>             Interface "breth0+"
>                 type: patch
>                 options: {peer="breth0-"}
>         Port "eth0"
>             Interface "eth0"
>         Port "breth0"
>             Interface "breth0"
>                 type: internal
>     Bridge br-int
>         Port br-int
>             Interface br-int
>                 type: internal
>         Port "breth0-"
>             Interface "breth0-"
>                 type: patch
>                 options: {peer="breth0+"}
>     ovs_version: "2.3.90"
>
> From one terminal, I send a bunch of traffic to a neighbour:
> # ovs-appctl upcall/disable-megaflows && hping3 --flood 10.0.2.2
>
> From another terminal, I observe the upcall output:
> # watch -n 0.2 'ovs-appctl upcall/show'
>
> After 10-20 seconds, the output in the second terminal freezes. Around 2-5
> minutes later, dmesg starts complaining about blocked threads:
>
> [9957.804271] openvswitch: Open vSwitch switching datapath 2.3.90, built Aug
> 27 2014 14:15:14
> [10088.874770] device ovs-system entered promiscuous mode
> [10088.883911] openvswitch: netlink: Key attribute has unexpected length
> (type=21, length=8, expected=4).
> [10088.913387] device br-int entered promiscuous mode
> [10088.917780] device eth0 entered promiscuous mode
> [10088.946122] device breth0 entered promiscuous mode
> [10117.536276] ADDRCONF(NETDEV_UP): eth0: link is not ready
> [10117.576174] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: RX
> [10117.576174] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [10127.808269] eth0: no IPv6 routers present
> [10131.104185] br-int: no IPv6 routers present
> [10133.976474] breth0: no IPv6 routers present
> [10321.389028] INFO: task ovs-vswitchd:7073 blocked for more than 120
> seconds.
> [10321.389028] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [10321.389028] ovs-vswitchd    D ffff88011fc13780     0  7073      1
> 0x00000000
> [10321.389028]  ffff8800c398e800 0000000000000082 0000000300000246
> ffff8801178610c0
> [10321.389028]  0000000000013780 ffff8800ca8bffd8 ffff8800ca8bffd8
> ffff8800c398e800
> [10321.389028]  ffff8800c398e800 0000000000000000 00000000ffffffff
> ffffffff8165c0f0
> [10321.389028] Call Trace:
> [10321.389028]  [<ffffffff8134fa44>] ? __mutex_lock_common.isra.5+0xff/0x164
> [10321.389028]  [<ffffffff8134f932>] ? mutex_lock+0x1a/0x2d
> [10321.389028]  [<ffffffff812af40c>] ? netlink_trim+0x13/0x7a
> [10321.389028]  [<ffffffff812b1178>] ? genl_rcv+0xe/0x28
> [10321.389028]  [<ffffffff812b0929>] ? netlink_unicast+0xe6/0x14e
> [10321.389028]  [<ffffffff812b0c05>] ? netlink_sendmsg+0x274/0x2ac
> [10321.389028]  [<ffffffff81280011>] ? sock_sendmsg+0xc1/0xde
> [10321.389028]  [<ffffffff8127fef0>] ? sock_recvmsg+0xcd/0xec
> [10321.389028]  [<ffffffff8110cdbf>] ? file_update_time+0x30/0x105
> [10321.389028]  [<ffffffff81036628>] ? should_resched+0x5/0x23
> [10321.389028]  [<ffffffff8134f144>] ? _cond_resched+0x7/0x1c
> [10321.389028]  [<ffffffff812894a8>] ? copy_from_user+0x18/0x30
> [10321.389028]  [<ffffffff81036628>] ? should_resched+0x5/0x23
> [10321.389028]  [<ffffffff812802c7>] ? ___sys_sendmsg+0x209/0x2a9
> [10321.389028]  [<ffffffff81037f8b>] ? __wake_up+0x35/0x46
> [10321.389028]  [<ffffffff8135049f>] ? _raw_spin_unlock_irqrestore+0xe/0xf
> [10321.389028]  [<ffffffff812afe3d>] ? netlink_insert+0x109/0x130
> [10321.389028]  [<ffffffff81036628>] ? should_resched+0x5/0x23
> [10321.389028]  [<ffffffff8134f144>] ? _cond_resched+0x7/0x1c
> [10321.389028]  [<ffffffff8128171e>] ? sys_getsockname+0x8e/0xb7
> [10321.389028]  [<ffffffff810fc0b3>] ? fget_light+0x67/0x7b
> [10321.389028]  [<ffffffff81281c37>] ? __sys_sendmsg+0x39/0x58
> [10321.389028]  [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b
> [10321.389028] INFO: task handler10:7112 blocked for more than 120 seconds.
> [10321.389028] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [10321.389028] handler10       D ffff8800d2c59800     0  7112      1
> 0x00000000
> [10321.389028]  ffff8800d2c59800 0000000000000082 0000000000000000
> ffff880118292800
> [10321.389028]  0000000000013780 ffff8800d0f73fd8 ffff8800d0f73fd8
> ffff8800d2c59800
> [10321.389028]  ffffffff8103661f 0000000000000010 0000000000000246
> ffffffff8165c0f0
> [10321.389028] Call Trace:
> [10321.389028]  [<ffffffff8103661f>] ? need_resched+0x14/0x18
> [10321.389028]  [<ffffffff8134fa44>] ? __mutex_lock_common.isra.5+0xff/0x164
> [10321.389028]  [<ffffffff8134f932>] ? mutex_lock+0x1a/0x2d
> [10321.389028]  [<ffffffff812af40c>] ? netlink_trim+0x13/0x7a
> [10321.389028]  [<ffffffff812b1178>] ? genl_rcv+0xe/0x28
> [10321.389028]  [<ffffffff812b0929>] ? netlink_unicast+0xe6/0x14e
> [10321.389028]  [<ffffffff812b1192>] ? genl_rcv+0x28/0x28
> [10321.389028]  [<ffffffff812b0e0f>] ? netlink_rcv_skb+0x53/0x7a
> [10321.389028]  [<ffffffff812b1189>] ? genl_rcv+0x1f/0x28
> [10321.389028]  [<ffffffff812b0929>] ? netlink_unicast+0xe6/0x14e
> [10321.389028]  [<ffffffff812b0c05>] ? netlink_sendmsg+0x274/0x2ac
> [10321.389028]  [<ffffffff81280011>] ? sock_sendmsg+0xc1/0xde
> [10321.389028]  [<ffffffff8134f144>] ? _cond_resched+0x7/0x1c
> [10321.389028]  [<ffffffff810be188>] ? update_page_reclaim_stat+0x17/0x44
> [10321.389028]  [<ffffffff812833dc>] ? sock_kmalloc+0x41/0x63
> [10321.389028]  [<ffffffff812833dc>] ? sock_kmalloc+0x41/0x63
> [10321.389028]  [<ffffffff810ecd8f>] ? __kmalloc+0x100/0x112
> [10321.389028]  [<ffffffff81036628>] ? should_resched+0x5/0x23
> [10321.389028]  [<ffffffff8134f144>] ? _cond_resched+0x7/0x1c
> [10321.389028]  [<ffffffff812894a8>] ? copy_from_user+0x18/0x30
> [10321.389028]  [<ffffffff812802c7>] ? ___sys_sendmsg+0x209/0x2a9
> [10321.389028]  [<ffffffff810ceb49>] ? pte_offset_kernel+0x16/0x35
> [10321.389028]  [<ffffffff813533d8>] ? do_page_fault+0x30a/0x345
> [10321.389028]  [<ffffffff8135049f>] ? _raw_spin_unlock_irqrestore+0xe/0xf
> [10321.389028]  [<ffffffff8112a6db>] ? ep_scan_ready_list.isra.7+0x13b/0x15b
> [10321.389028]  [<ffffffff8100d750>] ? __switch_to+0x1e5/0x258
> [10321.389028]  [<ffffffff8112b2b6>] ? sys_epoll_wait+0x29f/0x308
> [10321.389028]  [<ffffffff81039a92>] ? finish_task_switch+0x4e/0xb9
> [10321.389028]  [<ffffffff810fc0b3>] ? fget_light+0x67/0x7b
> [10321.389028]  [<ffffffff81281c37>] ? __sys_sendmsg+0x39/0x58
> [10321.389028]  [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b
> [10321.389028] INFO: task handler12:7113 blocked for more than 120 seconds.
> [10321.389028] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [10321.389028] handler12       D ffff88011fc93780     0  7113      1
> 0x00000000
> [10321.389028]  ffff8800c3a1d080 0000000000000082 dead000000000000
> ffff88011aeb0180
> [10321.389028]  0000000000013780 ffff8800dba07fd8 ffff8800dba07fd8
> ffff8800c3a1d080
> [10321.389028]  ffff8800ca807b98 00000001ca807b48 0000000000000041
> ffffffff8165c0f0
> [10321.389028] Call Trace:
> [10321.389028]  [<ffffffff8134fa44>] ? __mutex_lock_common.isra.5+0xff/0x164
> [10321.389028]  [<ffffffff8134f932>] ? mutex_lock+0x1a/0x2d
> [10321.389028]  [<ffffffff812af40c>] ? netlink_trim+0x13/0x7a
> [10321.389028]  [<ffffffff812b1178>] ? genl_rcv+0xe/0x28
> [10321.389028]  [<ffffffff812b0929>] ? netlink_unicast+0xe6/0x14e
> [10321.389028]  [<ffffffff812b0c05>] ? netlink_sendmsg+0x274/0x2ac
> [10321.389028]  [<ffffffff81280011>] ? sock_sendmsg+0xc1/0xde
> [10321.389028]  [<ffffffff81108b29>] ? __pollwait+0xce/0xce
> [10321.389028]  [<ffffffff8107116d>] ? arch_local_irq_save+0x11/0x17
> [10321.389028]  [<ffffffff8135049f>] ? _raw_spin_unlock_irqrestore+0xe/0xf
> [10321.389028]  [<ffffffff8112a7c8>] ? ep_poll_callback+0xbf/0xdb
> [10321.389028]  [<ffffffff81036628>] ? should_resched+0x5/0x23
> [10321.389028]  [<ffffffff8134f144>] ? _cond_resched+0x7/0x1c
> [10321.389028]  [<ffffffff812894a8>] ? copy_from_user+0x18/0x30
> [10321.389028]  [<ffffffff81036628>] ? should_resched+0x5/0x23
> [10321.389028]  [<ffffffff812802c7>] ? ___sys_sendmsg+0x209/0x2a9
> [10321.389028]  [<ffffffff810ceb49>] ? pte_offset_kernel+0x16/0x35
> [10321.389028]  [<ffffffff813533d8>] ? do_page_fault+0x30a/0x345
> [10321.389028]  [<ffffffff81036628>] ? should_resched+0x5/0x23
> [10321.389028]  [<ffffffff8134f144>] ? _cond_resched+0x7/0x1c
> [10321.389028]  [<ffffffff8128171e>] ? sys_getsockname+0x8e/0xb7
> [10321.389028]  [<ffffffff810fc0b3>] ? fget_light+0x67/0x7b
> [10321.389028]  [<ffffffff81281c37>] ? __sys_sendmsg+0x39/0x58
> [10321.389028]  [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b
> [10321.389028] INFO: task revalidator13:7114 blocked for more than 120
> seconds.
> [10321.389028] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [10321.389028] revalidator13   D ffff88011fd13780     0  7114      1
> 0x00000000
> [10321.389028]  ffff8800cabf2040 0000000000000082 0000000000000000
> ffff88011aec4780
> [10321.389028]  0000000000013780 ffff8800db83ffd8 ffff8800db83ffd8
> ffff8800cabf2040
> [10321.389028]  ffffffff8165c0f0 00000001ffffff10 ffffffff81040480
> ffffffff8165c0f0
> [10321.389028] Call Trace:
> [10321.389028]  [<ffffffff81040480>] ? mutex_spin_on_owner+0x24/0x39
> [10321.389028]  [<ffffffff8134fa44>] ? __mutex_lock_common.isra.5+0xff/0x164
> [10321.389028]  [<ffffffff8134f932>] ? mutex_lock+0x1a/0x2d
> [10321.389028]  [<ffffffff812af40c>] ? netlink_trim+0x13/0x7a
> [10321.389028]  [<ffffffff812b1178>] ? genl_rcv+0xe/0x28
> [10321.389028]  [<ffffffff812b0929>] ? netlink_unicast+0xe6/0x14e
> [10321.389028]  [<ffffffff812b0c05>] ? netlink_sendmsg+0x274/0x2ac
> [10321.389028]  [<ffffffff81280011>] ? sock_sendmsg+0xc1/0xde
> [10321.389028]  [<ffffffff8103f6d2>] ? try_to_wake_up+0x187/0x197
> [10321.389028]  [<ffffffff8106d63a>] ? wake_futex+0x5d/0x69
> [10321.389028]  [<ffffffff8106d5ce>] ? put_task_struct+0xd/0x1c
> [10321.389028]  [<ffffffff8106e397>] ? futex_wake+0xe9/0xfb
> [10321.389028]  [<ffffffff810fc0b3>] ? fget_light+0x67/0x7b
> [10321.389028]  [<ffffffff812818f8>] ? sys_sendto+0xf7/0x137
> [10321.389028]  [<ffffffff8106fcef>] ? sys_futex+0x120/0x151
> [10321.389028]  [<ffffffff81066311>] ? timekeeping_get_ns+0xd/0x2a
> [10321.389028]  [<ffffffff810667ff>] ? ktime_get_ts+0x5c/0x82
> [10321.389028]  [<ffffffff81036628>] ? should_resched+0x5/0x23
> [10321.389028]  [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b
> [10327.256206] br-int: no IPv6 routers present
>
> These repeat every 120 seconds, and I can no longer kill ovs-vswitchd or
> unload the ovs kmod.
>
> Reverting this patch seems to fix the issue. I tried some basic things (like
> shifting the sf_acts dereference within the disable_bh() section), but I
> can't figure out what's wrong. Perhaps some odd interaction with this
> particular kernel?
>
This is weird. I could not reproduce this on newer kernel. But I found
related bug. Can you try attached patch? It may be related.
I will submit formal patch later.

--Pravin.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to