Just to give a solution for the archives, we ended up upgrading the
kernel from the 2.6.32 EL6 kernel to the 3.10 kernel-lt elrepo kernel,
and have not seen a recurrence of the ovs lockup.
Jeff
On 02/28/2014 03:00 PM, Jeff Bachtel wrote:
Does anyone have any insight into this? For further datapoints, I
built the 2.0 release and much more current openvswitch snapshots
(most recently to commit bdeadfdd) which exhibited the same problems.
The CentOS 6 kernel is 2.6.32. Because of presumed incompatibility
with the Linux bridge module, I made sure bridge.o wasn't being
loaded. On a host where ovsdb-server had not yet become unresponsive,
ovs-vswitchd was unkillable, in state R<L. Could my problem be related
to vwsitchd becoming unresponsive under load, taking ovsdb-server with
it?
I've received further confirmation that this is involved in some way
with load, as a node inadvertently disconnected from the rest of the
Ceph cluster had a record uptime with openvswitch. If anyone can give
me pointers on getting a backtrace I'm happy to run things until
failure and get better data. I've had trouble with this at least as
far as using strace is concerned. As it is, I've cron'd a restart of
openvswitch every minute - obviously an incredibly unideal situation.
Thanks for any help,
Jeff
On 02/20/2014 12:54 AM, Jeff Bachtel wrote:
I'm running OpenVSwitch 1.11 from the RDO Havana repository. In
addition, I'm running OpenStack Havana, Neutron, and Ceph Emperor,
all on some CentOS 6.5 machines.
After installing Bacula on the previous openstack version (grizzly),
I noticed the networking had become somewhat load sensitive.
ovsdb-server was freezing - not responding to queries on its unix
socket and becoming unkillable in process state R< . Believing that
it was probably due to being behind in ovs version, I pushed ahead
with an upgrade only to find my stability problems become much much
worse. Every 20-30 minutes I can count on an ovsdb-server process
freezing.
At
https://drive.google.com/folderview?id=0B-wx2_T_hW-_OXZJWGJNc0l0MzQ&usp=sharing
please find a folder with shared copies of diagnostic files from a
machine with hung ovsdb-server. There is a process list (.ps,
apologies forgot postscript until upload was done), strace, dmesg,
and /var/log/messages.
The strace didn't reveal anything suspicious to me. To mitigate I
tried lowering log verbosity, completely recreating conf.db, as well
as frequent compacting (every minute) and putting the db on a
ramdisk, nothing worked as a solution.
The ovsdb-server processes most likely to succumb to locking run on
ceph hosts running osd - meaning they can see a lot of network
traffic, as well as disk i/o.
I don't understand what a simple database RPC server could be doing
that would cause it to become unkillable, especially with the attempt
at minimizing disk i/o by putting the db file on a ramdisk.
I hope someone has some ideas of what I might do to test or mitigate
the situation. Not running ceph osd on the hosts is, unfortunately,
not a solution I can use.
Thanks,
Jeff
_______________________________________________
discuss mailing list
discuss@openvswitch.org
http://openvswitch.org/mailman/listinfo/discuss