Yes, from following trace, DRBD and OVS block each other and enter into 
deadlock.
It seems someone sent a patch years ago to refine genl global lock as family 
granularity. So change the DRBD or OVS's genl family to different ones may fix 
this problem?

Thanks,
Tianpeng

======== Set primary ========
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525214] INFO: task 
ovs-vswitchd:5283 blocked for more than 120 seconds
.
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525243] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables th
is message.
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525254] ovs-vswitchd  D 00000000    
 0  5283   5282 0x00000004
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525261]  e7fc7cb0 00000282 010003ff 
00000000 c01d46c0 00000019 ed98f00
0 00000000
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525269]  00000000 00000000 00000012 
eda6b754 eda6b644 eda6b5b0 eda6b75
4 c16ca200
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525275]  00000000 5680f6a6 0000041d 
ed88a740 00000019 00067257 0000000
0 c01d4790
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525282] Call Trace:
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525295]  [<c01d46c0>] ? 
__pollwait+0x0/0xd0
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525301]  [<c01d4790>] ? 
pollwake+0x0/0x70
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525306]  [<c01d4790>] ? 
pollwake+0x0/0x70
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525315]  [<c03d418c>] 
__mutex_lock_slowpath+0x10c/0x160
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525320]  [<c03d3fe5>] 
mutex_lock+0x25/0x40
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525328]  [<c036d875>] 
genl_rcv+0x15/0x30
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525333]  [<c036ba81>] 
netlink_unicast+0x241/0x250
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525341]  [<c0349acc>] ? 
memcpy_fromiovec+0x4c/0x70
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525346]  [<c036c771>] 
netlink_sendmsg+0x1c1/0x280
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525351]  [<c033ffd7>] 
sock_sendmsg+0xd7/0x100
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525358]  [<c014e6b0>] ? 
autoremove_wake_function+0x0/0x50
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525362]  [<c014e6b0>] ? 
autoremove_wake_function+0x0/0x50
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525368]  [<c01d4790>] ? 
pollwake+0x0/0x70
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525374]  [<c0261871>] ? 
copy_from_user+0x41/0x70
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525379]  [<c0349df6>] ? 
verify_iovec+0x36/0xa0
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525384]  [<c0340116>] 
sys_sendmsg+0x116/0x230
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525388]  [<c0340c07>] ? 
sys_recvmsg+0xf7/0x1c0
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525396]  [<c01c43d9>] ? 
do_sync_read+0xd9/0x110
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525401]  [<c033f0d4>] ? 
sock_poll+0x14/0x20
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525408]  [<c01f540a>] ? 
ep_send_events_proc+0x5a/0x100
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525413]  [<c01f58ac>] ? 
ep_scan_ready_list+0xfc/0x150
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525418]  [<c03413a7>] 
sys_socketcall+0x247/0x270
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525424]  [<c0104571>] 
syscall_call+0x7/0xb
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525450] INFO: task drbdsetup:28552 
blocked for more than 120 seconds.
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525457] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables th
is message.
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525466] drbdsetup     D 00000001    
 0 28552      1 0x00000000
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525471]  edc15c54 00000286 edc15bd8 
00000001 00000003 ee1bcd08 ee1bcd0
4 00000000
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525478]  00000000 567d0166 0000041d 
ee1feb44 ee1fea34 ee1fe9a0 ee1feb4
4 c16ca200
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525485]  00000000 567ce256 0000041d 
ede1dac0 00000000 00000008 ee82519
8 ee1bcc00
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525491] Call Trace:
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525514]  [<f0c5091d>] ? 
_req_st_cond+0xed/0x130 [drbd]
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525531]  [<f0c53a1b>] 
drbd_req_state+0x14b/0x310 [drbd]
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525536]  [<c01444a9>] ? 
complete_signal+0xd9/0x1b0
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525541]  [<c014e6b0>] ? 
autoremove_wake_function+0x0/0x50
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525558]  [<f0c53c03>] 
_drbd_request_state+0x23/0xb0 [drbd]
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525563]  [<c0145435>] ? 
force_sig_info+0xa5/0xc0
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525581]  [<f0c4a038>] 
drbd_set_role+0x58/0x780 [drbd]
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525598]  [<f0c543a3>] ? 
drbd_nla_parse_nested+0x43/0x50 [drbd]
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525615]  [<f0c4ab96>] 
drbd_adm_set_role+0xa6/0xc0 [drbd]
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525621]  [<c036ebc3>] 
genl_rcv_msg+0x183/0x1c0
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525627]  [<c036ea40>] ? 
genl_rcv_msg+0x0/0x1c0
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525632]  [<c036bced>] 
netlink_rcv_skb+0x7d/0xa0
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525637]  [<c036d881>] 
genl_rcv+0x21/0x30
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525642]  [<c036ba81>] 
netlink_unicast+0x241/0x250
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525647]  [<c0349acc>] ? 
memcpy_fromiovec+0x4c/0x70
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525652]  [<c036c771>] 
netlink_sendmsg+0x1c1/0x280
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525657]  [<c033f63b>] 
sock_aio_write+0xeb/0x100
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525663]  [<c01b2043>] ? 
page_add_file_rmap+0x23/0x30
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525669]  [<c01c42c9>] 
do_sync_write+0xd9/0x110
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525674]  [<c014e6b0>] ? 
autoremove_wake_function+0x0/0x50.
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525680]  [<c01c4b68>] 
vfs_write+0x178/0x180
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525685]  [<c01c5192>] 
sys_write+0x42/0x70
Mar 27 13:24:29 drbd-jason-1 kernel: [ 4681.525689]  [<c0104571>] 
syscall_call+0x7/0xb
Mar 27 13:26:29 drbd-jason-1 kernel: [ 4801.524087] INFO: task 
ovs-vswitchd:5283 blocked for more than 120 seconds
Mar 27 13:26:29 drbd-jason-1 kernel: [ 4801.524102] "echo 0 > 
/proc/sys/kernel/hung_task

From: Jesse Gross
Date: 2013-03-30 11:57
To: tianpeng0826
CC: dev
Subject: Re: [ovs-dev] ovs-vswitchd hang in sendmsg
On Fri, Mar 29, 2013 at 8:38 PM, Tianpeng Zhang (Gmail)
<tianpeng0...@gmail.com> wrote:
> Hi All,
>
> I met an issue when running DRBD in Xenserver with ovs-1.7.1. DRBD works
> fine when creating and sync data. But when trying to down DRBD resource,
> ovs-vswitchd hangs for about 20 minutes, then all network connections
> broken.
>
> I add some debug trace, ovs-vswitchd finally stopped at sendmsg() for
> netlink message. The call path is:
> bridge_run_fast()->ofproto_run_fast()->run_fast()->handle_upcalls()->handle_miss_upcalls()->dpif_operate()->dpif_linux_operate__()->nl_sock_transact_multiple()->nl_sock_transact_multiple__()->sendmsg()
>
> vswitchd stop here because sendmsg() does not return.
>     465     memset(&msg, 0, sizeof msg);
>     466     msg.msg_iov = iovs;
>     467     msg.msg_iovlen = n;
>     468     do {
>     469         error = sendmsg(sock->fd, &msg, 0) < 0 ? errno : 0;
>     470     } while (error == EINTR);
>     471
>
> Several guys met similar issue before from Xen/DRBD's mail list, but the
> solution is just stop OVS, use linux bridge. I am thinking the issue may
> because before DRBD stop resource, it will do some cleanup for its netlink
> socket, this conflict with OVS's handling?

It looks like DRBD is also using genetlink for communication with
userspace.  There's a global lock so I suspect that DRBD is holding it
for a long time, which is blocking OVS.  The could also be deadlock if
there is another shared lock that is taken in a different order but
this seems somewhat less likely since there isn't a lot in common
between DRBD and OVS.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to