Hi,
Recently we have seen a pool-wide storage fence in a XenServer 6.2 SP1
enviroment. OVS 1.4.6+ were in use, up until the commit e58f54a4aa1 on
that branch.
In a 20 minutes period all 4 hosts in the pool lost network connectivity
and HA restarted them automatically. First ovsdb-server logged heartbeat
issues on the UNIX domain socket, shortly after iSCSI connection started
to fail, and finally ovs-vswitchd reported that the UNIX domain socket
is broken, always about 30 seconds after ovsdb.
It happened in the same way on each host, quite near to each other, and
after restart they were stable for an hour, but then the problems
started again, so the network backend were reverted to bridge.
OOMkiller haven't started, so probably there were enough memory, and
there is no sign the disk was filled up either.
Prior to this issue I've seen these kind of messages often in the log,
but I'm not sure if they have anything to do about it:
May 15 16:22:47 host ovs-vswitchd: 21539|dpif|WARN|system@xapi3: recv
failed (No buffer space available)
May 15 16:22:47 host ovs-vswitchd: 21540|dpif|WARN|system@xapi3: recv
failed (No buffer space available)
I'm including some log excerpts from one server:
May 15 16:22:58 host ovsdb-server:
00023|reconnect|ERR|unix:/tmp/stream-unix.5393.0: no response to
inactivity probe after 5.01 seconds, disconnecting
May 15 16:23:11 host kernel: [3328870.137299] connection3:0: ping
timeout of 15 secs expired, recv timeout 10, last rx 332854265, last
ping 332855265, now 332856765
May 15 16:23:11 host kernel: [3328870.137323] connection3:0: detected
conn error (1011)
May 15 16:23:12 host iscsid: Kernel reported iSCSI connection 3:0 error
(1011) state (3)
May 15 16:23:12 host kernel: [3328871.019145] connection2:0: ping
timeout of 15 secs expired, recv timeout 10, last rx 332854353, last
ping 332855353, now 332856853
May 15 16:23:12 host kernel: [3328871.019167] connection2:0: detected
conn error (1011)
May 15 16:23:13 host iscsid: Kernel reported iSCSI connection 2:0 error
(1011) state (3)
...
May 15 16:23:27 host ovs-vswitchd:
21541|jsonrpc|WARN|unix:/var/run/openvswitch/db.sock: send error: Broken
pipe
May 15 16:23:27 host ovs-vswitchd: 21542|timeval|WARN|40509 ms poll
interval (0 ms user, 32590 ms system) is over 4324 times the weighted
mean interval 2 ms (52404048 samples)
May 15 16:23:27 host ovs-vswitchd: 21543|timeval|WARN|context switches:
0 voluntary, 879 involuntary
May 15 16:23:27 host ovs-vswitchd: 21544|coverage|INFO|Event coverage
(epoch 52404048/entire run), hash=37fe45f6:
May 15 16:23:27 host ovs-vswitchd:
21545|coverage|INFO|ofproto_dpif_xlate 4617 / 493556267
May 15 16:23:27 host ovs-vswitchd: 21546|coverage|INFO|facet_revalidate
1 / 3332257
May 15 16:23:27 host ovs-vswitchd: 21547|coverage|INFO|dpif_flow_del
1786 / 91737202
May 15 16:23:27 host ovs-vswitchd: 21548|coverage|INFO|flow_extract
150 / 99810556
May 15 16:23:27 host ovs-vswitchd: 21549|coverage|INFO|hmap_expand
53 / 828100019
May 15 16:23:27 host ovs-vswitchd:
21550|coverage|INFO|mac_learning_learned 1 / 1950471
May 15 16:23:27 host ovs-vswitchd:
21551|coverage|INFO|mac_learning_expired 33 / 1949928
May 15 16:23:27 host ovs-vswitchd: 21552|coverage|INFO|netdev_get_stats
15 / 7983394
May 15 16:23:27 host ovs-vswitchd: 21553|coverage|INFO|poll_fd_wait
10 / 577246726
May 15 16:23:27 host ovs-vswitchd: 21554|coverage|INFO|poll_zero_timeout
1 / 537392
May 15 16:23:27 host ovs-vswitchd: 21555|coverage|INFO|unixctl_received
1 / 663108
May 15 16:23:27 host ovs-vswitchd: 21556|coverage|INFO|unixctl_replied
1 / 663108
May 15 16:23:27 host ovs-vswitchd: 21557|coverage|INFO|util_xalloc
23796 / 22430418786
May 15 16:23:27 host ovs-vswitchd: 21558|coverage|INFO|netdev_ethtool
30 / 15967005
May 15 16:23:27 host ovs-vswitchd: 21559|coverage|INFO|netlink_received
3888 / 401972766
May 15 16:23:27 host ovs-vswitchd:
21560|coverage|INFO|netlink_recv_jumbo 92 / 31719860
May 15 16:23:27 host ovs-vswitchd: 21561|coverage|INFO|netlink_sent
1983 / 303430371
May 15 16:23:27 host ovs-vswitchd:
21562|coverage|INFO|bridge_reconfigure 0 / 57
May 15 16:23:27 host ovs-vswitchd: 21563|coverage|INFO|ofproto_flush
0 / 1
May 15 16:23:27 host ovs-vswitchd:
21564|coverage|INFO|ofproto_recv_openflow 0 / 82
May 15 16:23:27 host ovs-vswitchd:
21565|coverage|INFO|ofproto_update_port 0 / 175
May 15 16:23:27 host ovs-vswitchd: 21566|coverage|INFO|facet_unexpected
0 / 9856
May 15 16:23:27 host ovs-vswitchd: 21567|coverage|INFO|dpif_port_add
0 / 34
May 15 16:23:27 host ovs-vswitchd: 21568|coverage|INFO|dpif_flow_flush
0 / 2
May 15 16:23:27 host ovs-vswitchd: 21569|coverage|INFO|dpif_flow_put
0 / 302480
May 15 16:23:27 host ovs-vswitchd: 21570|coverage|INFO|dpif_execute
0 / 2
May 15 16:23:27 host ovs-vswitchd: 21571|coverage|INFO|dpif_purge
0 / 1
May 15 16:23:27 host ovs-vswitchd: 21572|coverage|INFO|hmap_pathological
0 / 2439
May 15 16:23:27 host ovs-vswitchd: 21573|coverage|INFO|rconn_queued
0 / 41
May 15 16:23:27 host ovs-vswitchd: 21574|coverage|INFO|rconn_sent
0 / 41
May 15 16:23:27 host ovs-vswitchd: 21575|coverage|INFO|pstream_open
0 / 2
May 15 16:23:27 host ovs-vswitchd: 21576|coverage|INFO|stream_open
0 / 1
May 15 16:23:27 host ovs-vswitchd: 21577|coverage|INFO|vconn_received
0 / 123
May 15 16:23:27 host ovs-vswitchd: 21578|coverage|INFO|vconn_sent
0 / 82
May 15 16:23:27 host ovs-vswitchd:
21579|coverage|INFO|netdev_set_policing 0 / 601
May 15 16:23:27 host ovs-vswitchd:
21580|coverage|INFO|netdev_get_ifindex 0 / 71
May 15 16:23:27 host ovs-vswitchd: 21581|coverage|INFO|netdev_get_hwaddr
0 / 155
May 15 16:23:27 host ovs-vswitchd: 21582|coverage|INFO|netdev_set_hwaddr
0 / 3
May 15 16:23:27 host ovs-vswitchd: 21583|coverage|INFO|nln_changed
0 / 312
May 15 16:23:27 host ovs-vswitchd: 21584|coverage|INFO|netlink_overflow
0 / 127
May 15 16:23:27 host ovs-vswitchd: 21585|coverage|INFO|38 events never hit
May 15 16:23:27 host ovs-vswitchd: 21586|poll_loop|WARN|wakeup due to
[POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at
lib/dpif-linux.c:1197 (76% CPU usage)
May 15 16:23:27 host ovs-vswitchd:
21587|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection
dropped (Broken pipe)
May 15 16:23:27 host ovs-vswitchd: 21588|poll_loop|WARN|wakeup due to
[POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at
lib/dpif-linux.c:1197 (76% CPU usage)
May 15 16:23:27 host ovs-vswitchd: 21589|poll_loop|WARN|wakeup due to
[POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at
lib/dpif-linux.c:1197 (76% CPU usage)
May 15 16:23:27 host ovs-vswitchd: 21590|poll_loop|WARN|wakeup due to
[POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at
lib/dpif-linux.c:1197 (76% CPU usage)
May 15 16:23:27 host ovs-vswitchd: 21591|poll_loop|WARN|wakeup due to
[POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at
lib/dpif-linux.c:1197 (76% CPU usage)
May 15 16:23:27 host ovs-vswitchd: 21592|poll_loop|WARN|wakeup due to
[POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at
lib/dpif-linux.c:1197 (76% CPU usage)
May 15 16:23:27 host ovs-vswitchd: 21593|poll_loop|WARN|wakeup due to
[POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at
lib/dpif-linux.c:1197 (76% CPU usage)
May 15 16:23:27 host ovs-vswitchd: 21594|poll_loop|WARN|wakeup due to
[POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at
lib/dpif-linux.c:1197 (76% CPU usage)
May 15 16:23:27 host ovs-vswitchd: 21595|poll_loop|WARN|wakeup due to
[POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at
lib/dpif-linux.c:1197 (76% CPU usage)
...
[these kept coming for a while before restart]
Anyone has any idea where to look further?
Regards,
Zoli
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev