> -----Original Message----- > From: discuss [mailto:discuss-boun...@openvswitch.org] On Behalf Of Patrik > Andersson R > Sent: Friday, February 5, 2016 12:27 PM > To: Daniele Di Proietto; Iezzi, Federico; Ben Pfaff > Cc: discuss@openvswitch.org > Subject: Re: [ovs-discuss] dpdk watchdog stuck? > > Hi, > > I applied the patch suggested: > > http://openvswitch.org/pipermail/dev/2016-January/065073.html > > While I can appreciate the issue it addresses, it did not help with our > problem. > > What happens in our case is the dpdk_watchdog thread is blocked > from going into quiescent state by the dpdk_mutex. This mutex is held > in the destroy_device call from a vhost_thread, effectively giving us a > kind of deadlock. > > The ovsrcu_synchronize() will then wait indefinitely for the blocked > thread to quiesce. > > One solution that we came up with that resolved the > dead-lock is as shown below. > > I'm wondering if we missed any important aspect of the rcu solution, > since we identified that the dpdk_mutex should not be locked over the > rcu_synchronize() call. > > > @@ -1759,21 +1759,23 @@ destroy_device(volatile struct virtio_net *dev) > dev->flags &= ~VIRTIO_DEV_RUNNING; > ovsrcu_set(&vhost_dev->virtio_dev, NULL); > + } > + } > + > + ovs_mutex_unlock(&dpdk_mutex); > > /* > * Wait for other threads to quiesce before > * setting the virtio_dev to NULL. > */ > ovsrcu_synchronize(); > /* > * As call to ovsrcu_synchronize() will end the > quiescent state, > * put thread back into quiescent state before returning. > */ > ovsrcu_quiesce_start(); > - } > - } > - > - ovs_mutex_unlock(&dpdk_mutex); > > VLOG_INFO("vHost Device '%s' (%ld) has been removed", > dev->ifname, dev->device_fh); > } > > > Any thoughts on this will be appreciated.
Hi Patrik, that seems reasonable to me. synchronize()/quiesce_start() would not need to be called if we don't set a virtio_dev to NULL, and as an aside we don't need to keep looking through the list once we've found the device. I've posted a modified version of your code snippet, let me know what you think? http://openvswitch.org/pipermail/dev/2016-February/065740.html Kevin. > > Regards, > > Patrik > > > On 01/27/2016 08:26 AM, Patrik Andersson R wrote: > > Hi, > > > > thank you for the link to the patch. Will try that out when I get a > > chance. > > > > I don't yet have a back-trace for the instance when the tracing > > indicates the > > "main" thread, it does not happen that often. > > > > For the "watchdog3" issue though, we seem to be waiting to acquire a > > mutex: > > > > #0 (LWP 8377) "dpdk_watchdog3" in __lll_lock_wait () > > #1 in _L_lock_909 > > #2 in > > __GI___pthread_mutex_lock (mutex=0x955680) > > ... > > > > The thread that currently holds the mutex is > > > > Thread (LWP 8378) "vhost_thread2" in poll () > > > > > > Mutex data: p *(pthread_mutex_t*)0x955680 > > > > __lock = 2, > > __count = 0, > > __owner = 8378, > > __nusers = 1, > > __kind = 2, > > __spins = 0, > > __elision = 0, > > ... > > > > > > Any ideas on this will be appreciated. > > > > Regards, > > > > Patrik > > > > On 01/27/2016 06:18 AM, Daniele Di Proietto wrote: > >> Hi, > >> > >> Ben, turns out I was wrong, this appears to be a genuine bug in > >> dpif-netdev. > >> > >> I sent a fix that I believe might be related to the bug observed > >> here: > >> > >> http://openvswitch.org/pipermail/dev/2016-January/065073.html > >> > >> Otherwise it would be interesting to get a backtrace of the main thread > >> from gdb to investigate further. > >> > >> Thanks, > >> > >> Daniele > >> > >> On 26/01/2016 00:23, "Iezzi, Federico" <federico.ie...@hpe.com> wrote: > >> > >>> Hi there, > >>> > >>> I have the same issue with OVS 2.4 (latest commit in the branch 2.4) > >>> and > >>> DPDK 2.0.0 in Debian 8 environment. > >>> After a while it just stuck. > >>> > >>> Regards, > >>> Federico > >>> > >>> -----Original Message----- > >>> From: discuss [mailto:discuss-boun...@openvswitch.org] On Behalf Of Ben > >>> Pfaff > >>> Sent: Tuesday, January 26, 2016 7:13 AM > >>> To: Daniele di Proietto <diproiet...@vmware.com> > >>> Cc: discuss@openvswitch.org > >>> Subject: Re: [ovs-discuss] dpdk watchdog stuck? > >>> > >>> Daniele, I think that you said in our meeting today that there was some > >>> sort of bug that falsely blames a thread. Can you explain further? > >>> > >>> On Mon, Jan 25, 2016 at 09:29:52PM +0100, Patrik Andersson R wrote: > >>>> Right, that is likely for sure. Will look there first. > >>>> > >>>> What do you think of the case where the thread is "main". I've got > >>>> examples of this one as well. Have not been able to figure out so far > >>>> what would cause this. > >>>> > >>>> ... > >>>> ovs-vswitchd.log.1.1.1.1:2016-01-23T01:47:19.026Z|00016|ovs_rcu(urcu2) > >>>> |WARN|blocked > >>>> 32768000 ms waiting for main to quiesce > >>>> ovs-vswitchd.log.1.1.1.1:2016-01-23T10:53:27.026Z|00017|ovs_rcu(urcu2) > >>>> |WARN|blocked > >>>> 65536000 ms waiting for main to quiesce > >>>> ovs-vswitchd.log.1.1.1.1:2016-01-24T05:05:43.026Z|00018|ovs_rcu(urcu2) > >>>> |WARN|blocked > >>>> 131072000 ms waiting for main to quiesce > >>>> ovs-vswitchd.log.1.1.1.1:2016-01-24T18:24:40.826Z|00001|ovs_rcu(urcu1) > >>>> |WARN|blocked > >>>> 1092 ms waiting for main to quiesce > >>>> ovs-vswitchd.log.1.1.1.1:2016-01-24T18:24:41.805Z|00002|ovs_rcu(urcu1) > >>>> |WARN|blocked > >>>> 2072 ms waiting for main to quiesce > >>>> ... > >>>> > >>>> Could it be in connection with a deletion of a netdev port? > >>>> > >>>> Regards, > >>>> > >>>> Patrik > >>>> > >>>> > >>>> On 01/25/2016 07:50 PM, Ben Pfaff wrote: > >>>>> On Mon, Jan 25, 2016 at 03:09:09PM +0100, Patrik Andersson R wrote: > >>>>>> during robustness testing, where VM:s are booted and deleted using > >>>>>> nova boot/delete in rather rapid succession, VMs get stuck in > >>>>>> spawning state after a few test cycles. Presumably this is due to > >>>>>> the OVS not responding to port additions and deletions anymore, or > >>>>>> rather that responses to these requests become painfully slow. Other > >>>>>> requests towards the vswitchd fail to complete in any reasonable > >>>>>> time frame as well, ovs-appctl vlog/set is one example. > >>>>>> > >>>>>> The only conclusion I can draw at the moment is that some thread > >>>>>> (I've observed main and dpdk_watchdog3) is blocking the > >>>>>> ovsrcu_synchronize() operation for "infinite" time and there is no > >>>> fall-back to get out of this. > >>>>>> To > >>>>>> recover, the minimum operation seems to be a service restart of the > >>>>>> openvswitch-switch service but that seems to cause other issues > >>>> longer term. > >>>>>> In the vswitch log when this happens the following can be observed: > >>>>>> > >>>>>> 2016-01-24T20:36:14.601Z|02742|ovs_rcu(vhost_thread2)|WARN|blocked > >>>>>> 1000 ms waiting for dpdk_watchdog3 to quiesce > >>>>> This looks like a bug somewhere in the DPDK code. The watchdog code > >>>>> is really simple: > >>>>> > >>>>> static void * > >>>>> dpdk_watchdog(void *dummy OVS_UNUSED) > >>>>> { > >>>>> struct netdev_dpdk *dev; > >>>>> > >>>>> pthread_detach(pthread_self()); > >>>>> > >>>>> for (;;) { > >>>>> ovs_mutex_lock(&dpdk_mutex); > >>>>> LIST_FOR_EACH (dev, list_node, &dpdk_list) { > >>>>> ovs_mutex_lock(&dev->mutex); > >>>>> check_link_status(dev); > >>>>> ovs_mutex_unlock(&dev->mutex); > >>>>> } > >>>>> ovs_mutex_unlock(&dpdk_mutex); > >>>>> xsleep(DPDK_PORT_WATCHDOG_INTERVAL); > >>>>> } > >>>>> > >>>>> return NULL; > >>>>> } > >>>>> > >>>>> Although it looks at first glance like it doesn't quiesce, xsleep() > >>>>> does that internally, so I guess check_link_status() must be hanging. > >>> _______________________________________________ > >>> discuss mailing list > >>> discuss@openvswitch.org > >>> http://openvswitch.org/mailman/listinfo/discuss > > > > _______________________________________________ > > discuss mailing list > > discuss@openvswitch.org > > http://openvswitch.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list > discuss@openvswitch.org > http://openvswitch.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss@openvswitch.org http://openvswitch.org/mailman/listinfo/discuss