Hi Kevin,
thank you for the quick response.
I was wondering about that loop, thought that one single instance of the
device
would exist only.
The patch looks fine but I have yet to test it.
If I'm to be picky then:
1) the "exist = true;" should go outside the mutex lock since it does
not need it
2) style: I think use "if (exist) {}", rather than "if (exist ==
true) {}"
Regards,
Patrik
On 02/05/2016 06:09 PM, Traynor, Kevin wrote:
-----Original Message-----
From: discuss [mailto:discuss-boun...@openvswitch.org] On Behalf Of Patrik
Andersson R
Sent: Friday, February 5, 2016 12:27 PM
To: Daniele Di Proietto; Iezzi, Federico; Ben Pfaff
Cc: discuss@openvswitch.org
Subject: Re: [ovs-discuss] dpdk watchdog stuck?
Hi,
I applied the patch suggested:
http://openvswitch.org/pipermail/dev/2016-January/065073.html
While I can appreciate the issue it addresses, it did not help with our
problem.
What happens in our case is the dpdk_watchdog thread is blocked
from going into quiescent state by the dpdk_mutex. This mutex is held
in the destroy_device call from a vhost_thread, effectively giving us a
kind of deadlock.
The ovsrcu_synchronize() will then wait indefinitely for the blocked
thread to quiesce.
One solution that we came up with that resolved the
dead-lock is as shown below.
I'm wondering if we missed any important aspect of the rcu solution,
since we identified that the dpdk_mutex should not be locked over the
rcu_synchronize() call.
@@ -1759,21 +1759,23 @@ destroy_device(volatile struct virtio_net *dev)
dev->flags &= ~VIRTIO_DEV_RUNNING;
ovsrcu_set(&vhost_dev->virtio_dev, NULL);
+ }
+ }
+
+ ovs_mutex_unlock(&dpdk_mutex);
/*
* Wait for other threads to quiesce before
* setting the virtio_dev to NULL.
*/
ovsrcu_synchronize();
/*
* As call to ovsrcu_synchronize() will end the
quiescent state,
* put thread back into quiescent state before returning.
*/
ovsrcu_quiesce_start();
- }
- }
-
- ovs_mutex_unlock(&dpdk_mutex);
VLOG_INFO("vHost Device '%s' (%ld) has been removed",
dev->ifname, dev->device_fh);
}
Any thoughts on this will be appreciated.
Hi Patrik, that seems reasonable to me. synchronize()/quiesce_start() would not
need to be called if we don't set a virtio_dev to NULL, and as an aside we don't
need to keep looking through the list once we've found the device.
I've posted a modified version of your code snippet, let me know what you think?
http://openvswitch.org/pipermail/dev/2016-February/065740.html
Kevin.
Regards,
Patrik
On 01/27/2016 08:26 AM, Patrik Andersson R wrote:
Hi,
thank you for the link to the patch. Will try that out when I get a
chance.
I don't yet have a back-trace for the instance when the tracing
indicates the
"main" thread, it does not happen that often.
For the "watchdog3" issue though, we seem to be waiting to acquire a
mutex:
#0 (LWP 8377) "dpdk_watchdog3" in __lll_lock_wait ()
#1 in _L_lock_909
#2 in
__GI___pthread_mutex_lock (mutex=0x955680)
...
The thread that currently holds the mutex is
Thread (LWP 8378) "vhost_thread2" in poll ()
Mutex data: p *(pthread_mutex_t*)0x955680
__lock = 2,
__count = 0,
__owner = 8378,
__nusers = 1,
__kind = 2,
__spins = 0,
__elision = 0,
...
Any ideas on this will be appreciated.
Regards,
Patrik
On 01/27/2016 06:18 AM, Daniele Di Proietto wrote:
Hi,
Ben, turns out I was wrong, this appears to be a genuine bug in
dpif-netdev.
I sent a fix that I believe might be related to the bug observed
here:
http://openvswitch.org/pipermail/dev/2016-January/065073.html
Otherwise it would be interesting to get a backtrace of the main thread
from gdb to investigate further.
Thanks,
Daniele
On 26/01/2016 00:23, "Iezzi, Federico" <federico.ie...@hpe.com> wrote:
Hi there,
I have the same issue with OVS 2.4 (latest commit in the branch 2.4)
and
DPDK 2.0.0 in Debian 8 environment.
After a while it just stuck.
Regards,
Federico
-----Original Message-----
From: discuss [mailto:discuss-boun...@openvswitch.org] On Behalf Of Ben
Pfaff
Sent: Tuesday, January 26, 2016 7:13 AM
To: Daniele di Proietto <diproiet...@vmware.com>
Cc: discuss@openvswitch.org
Subject: Re: [ovs-discuss] dpdk watchdog stuck?
Daniele, I think that you said in our meeting today that there was some
sort of bug that falsely blames a thread. Can you explain further?
On Mon, Jan 25, 2016 at 09:29:52PM +0100, Patrik Andersson R wrote:
Right, that is likely for sure. Will look there first.
What do you think of the case where the thread is "main". I've got
examples of this one as well. Have not been able to figure out so far
what would cause this.
...
ovs-vswitchd.log.1.1.1.1:2016-01-23T01:47:19.026Z|00016|ovs_rcu(urcu2)
|WARN|blocked
32768000 ms waiting for main to quiesce
ovs-vswitchd.log.1.1.1.1:2016-01-23T10:53:27.026Z|00017|ovs_rcu(urcu2)
|WARN|blocked
65536000 ms waiting for main to quiesce
ovs-vswitchd.log.1.1.1.1:2016-01-24T05:05:43.026Z|00018|ovs_rcu(urcu2)
|WARN|blocked
131072000 ms waiting for main to quiesce
ovs-vswitchd.log.1.1.1.1:2016-01-24T18:24:40.826Z|00001|ovs_rcu(urcu1)
|WARN|blocked
1092 ms waiting for main to quiesce
ovs-vswitchd.log.1.1.1.1:2016-01-24T18:24:41.805Z|00002|ovs_rcu(urcu1)
|WARN|blocked
2072 ms waiting for main to quiesce
...
Could it be in connection with a deletion of a netdev port?
Regards,
Patrik
On 01/25/2016 07:50 PM, Ben Pfaff wrote:
On Mon, Jan 25, 2016 at 03:09:09PM +0100, Patrik Andersson R wrote:
during robustness testing, where VM:s are booted and deleted using
nova boot/delete in rather rapid succession, VMs get stuck in
spawning state after a few test cycles. Presumably this is due to
the OVS not responding to port additions and deletions anymore, or
rather that responses to these requests become painfully slow. Other
requests towards the vswitchd fail to complete in any reasonable
time frame as well, ovs-appctl vlog/set is one example.
The only conclusion I can draw at the moment is that some thread
(I've observed main and dpdk_watchdog3) is blocking the
ovsrcu_synchronize() operation for "infinite" time and there is no
fall-back to get out of this.
To
recover, the minimum operation seems to be a service restart of the
openvswitch-switch service but that seems to cause other issues
longer term.
In the vswitch log when this happens the following can be observed:
2016-01-24T20:36:14.601Z|02742|ovs_rcu(vhost_thread2)|WARN|blocked
1000 ms waiting for dpdk_watchdog3 to quiesce
This looks like a bug somewhere in the DPDK code. The watchdog code
is really simple:
static void *
dpdk_watchdog(void *dummy OVS_UNUSED)
{
struct netdev_dpdk *dev;
pthread_detach(pthread_self());
for (;;) {
ovs_mutex_lock(&dpdk_mutex);
LIST_FOR_EACH (dev, list_node, &dpdk_list) {
ovs_mutex_lock(&dev->mutex);
check_link_status(dev);
ovs_mutex_unlock(&dev->mutex);
}
ovs_mutex_unlock(&dpdk_mutex);
xsleep(DPDK_PORT_WATCHDOG_INTERVAL);
}
return NULL;
}
Although it looks at first glance like it doesn't quiesce, xsleep()
does that internally, so I guess check_link_status() must be hanging.
_______________________________________________
discuss mailing list
discuss@openvswitch.org
http://openvswitch.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list
discuss@openvswitch.org
http://openvswitch.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list
discuss@openvswitch.org
http://openvswitch.org/mailman/listinfo/discuss
_______________________________________________
discuss mailing list
discuss@openvswitch.org
http://openvswitch.org/mailman/listinfo/discuss