Yes, from following trace, DRBD and OVS block each other and enter into
deadlock.
It seems someone sent a patch years ago to refine genl global lock as family
granularity. So change the DRBD or OVS's genl family to different ones may fix
this problem?
Thanks,
Tianpeng
======== Set primary ========
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525214] INFO: task
ovs-vswitchd:5283 blocked for more than 120 seconds
.
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525243] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables th
is message.
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525254] ovs-vswitchd D 00000000
0 5283 5282 0x00000004
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525261] e7fc7cb0 00000282 010003ff
00000000 c01d46c0 00000019 ed98f00
0 00000000
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525269] 00000000 00000000 00000012
eda6b754 eda6b644 eda6b5b0 eda6b75
4 c16ca200
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525275] 00000000 5680f6a6 0000041d
ed88a740 00000019 00067257 0000000
0 c01d4790
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525282] Call Trace:
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525295] [<c01d46c0>] ?
__pollwait+0x0/0xd0
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525301] [<c01d4790>] ?
pollwake+0x0/0x70
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525306] [<c01d4790>] ?
pollwake+0x0/0x70
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525315] [<c03d418c>]
__mutex_lock_slowpath+0x10c/0x160
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525320] [<c03d3fe5>]
mutex_lock+0x25/0x40
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525328] [<c036d875>]
genl_rcv+0x15/0x30
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525333] [<c036ba81>]
netlink_unicast+0x241/0x250
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525341] [<c0349acc>] ?
memcpy_fromiovec+0x4c/0x70
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525346] [<c036c771>]
netlink_sendmsg+0x1c1/0x280
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525351] [<c033ffd7>]
sock_sendmsg+0xd7/0x100
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525358] [<c014e6b0>] ?
autoremove_wake_function+0x0/0x50
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525362] [<c014e6b0>] ?
autoremove_wake_function+0x0/0x50
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525368] [<c01d4790>] ?
pollwake+0x0/0x70
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525374] [<c0261871>] ?
copy_from_user+0x41/0x70
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525379] [<c0349df6>] ?
verify_iovec+0x36/0xa0
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525384] [<c0340116>]
sys_sendmsg+0x116/0x230
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525388] [<c0340c07>] ?
sys_recvmsg+0xf7/0x1c0
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525396] [<c01c43d9>] ?
do_sync_read+0xd9/0x110
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525401] [<c033f0d4>] ?
sock_poll+0x14/0x20
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525408] [<c01f540a>] ?
ep_send_events_proc+0x5a/0x100
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525413] [<c01f58ac>] ?
ep_scan_ready_list+0xfc/0x150
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525418] [<c03413a7>]
sys_socketcall+0x247/0x270
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525424] [<c0104571>]
syscall_call+0x7/0xb
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525450] INFO: task drbdsetup:28552
blocked for more than 120 seconds.
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525457] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables th
is message.
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525466] drbdsetup D 00000001
0 28552 1 0x00000000
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525471] edc15c54 00000286 edc15bd8
00000001 00000003 ee1bcd08 ee1bcd0
4 00000000
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525478] 00000000 567d0166 0000041d
ee1feb44 ee1fea34 ee1fe9a0 ee1feb4
4 c16ca200
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525485] 00000000 567ce256 0000041d
ede1dac0 00000000 00000008 ee82519
8 ee1bcc00
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525491] Call Trace:
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525514] [<f0c5091d>] ?
_req_st_cond+0xed/0x130 [drbd]
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525531] [<f0c53a1b>]
drbd_req_state+0x14b/0x310 [drbd]
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525536] [<c01444a9>] ?
complete_signal+0xd9/0x1b0
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525541] [<c014e6b0>] ?
autoremove_wake_function+0x0/0x50
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525558] [<f0c53c03>]
_drbd_request_state+0x23/0xb0 [drbd]
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525563] [<c0145435>] ?
force_sig_info+0xa5/0xc0
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525581] [<f0c4a038>]
drbd_set_role+0x58/0x780 [drbd]
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525598] [<f0c543a3>] ?
drbd_nla_parse_nested+0x43/0x50 [drbd]
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525615] [<f0c4ab96>]
drbd_adm_set_role+0xa6/0xc0 [drbd]
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525621] [<c036ebc3>]
genl_rcv_msg+0x183/0x1c0
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525627] [<c036ea40>] ?
genl_rcv_msg+0x0/0x1c0
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525632] [<c036bced>]
netlink_rcv_skb+0x7d/0xa0
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525637] [<c036d881>]
genl_rcv+0x21/0x30
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525642] [<c036ba81>]
netlink_unicast+0x241/0x250
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525647] [<c0349acc>] ?
memcpy_fromiovec+0x4c/0x70
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525652] [<c036c771>]
netlink_sendmsg+0x1c1/0x280
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525657] [<c033f63b>]
sock_aio_write+0xeb/0x100
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525663] [<c01b2043>] ?
page_add_file_rmap+0x23/0x30
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525669] [<c01c42c9>]
do_sync_write+0xd9/0x110
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525674] [<c014e6b0>] ?
autoremove_wake_function+0x0/0x50.
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525680] [<c01c4b68>]
vfs_write+0x178/0x180
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525685] [<c01c5192>]
sys_write+0x42/0x70
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525689] [<c0104571>]
syscall_call+0x7/0xb
Mar 27 13:26:29 drbd-jason-1 kernel: [ 4801.524087] INFO: task
ovs-vswitchd:5283 blocked for more than 120 seconds
Mar 27 13:26:29 drbd-jason-1 kernel: [ 4801.524102] "echo 0 >
/proc/sys/kernel/hung_task
From: Jesse Gross
Date: 2013-03-30 11:57
To: tianpeng0826
CC: dev
Subject: Re: [ovs-dev] ovs-vswitchd hang in sendmsg
On Fri, Mar 29, 2013 at 8:38 PM, Tianpeng Zhang (Gmail)
<tianpeng0...@gmail.com> wrote:
> Hi All,
>
> I met an issue when running DRBD in Xenserver with ovs-1.7.1. DRBD works
> fine when creating and sync data. But when trying to down DRBD resource,
> ovs-vswitchd hangs for about 20 minutes, then all network connections
> broken.
>
> I add some debug trace, ovs-vswitchd finally stopped at sendmsg() for
> netlink message. The call path is:
> bridge_run_fast()->ofproto_run_fast()->run_fast()->handle_upcalls()->handle_miss_upcalls()->dpif_operate()->dpif_linux_operate__()->nl_sock_transact_multiple()->nl_sock_transact_multiple__()->sendmsg()
>
> vswitchd stop here because sendmsg() does not return.
> 465 memset(&msg, 0, sizeof msg);
> 466 msg.msg_iov = iovs;
> 467 msg.msg_iovlen = n;
> 468 do {
> 469 error = sendmsg(sock->fd, &msg, 0) < 0 ? errno : 0;
> 470 } while (error == EINTR);
> 471
>
> Several guys met similar issue before from Xen/DRBD's mail list, but the
> solution is just stop OVS, use linux bridge. I am thinking the issue may
> because before DRBD stop resource, it will do some cleanup for its netlink
> socket, this conflict with OVS's handling?
It looks like DRBD is also using genetlink for communication with
userspace. There's a global lock so I suspect that DRBD is holding it
for a long time, which is blocking OVS. The could also be deadlock if
there is another shared lock that is taken in a different order but
this seems somewhat less likely since there isn't a lot in common
between DRBD and OVS.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev