Hi,

Recently we have seen a pool-wide storage fence in a XenServer 6.2 SP1 enviroment. OVS 1.4.6+ were in use, up until the commit e58f54a4aa1 on that branch. In a 20 minutes period all 4 hosts in the pool lost network connectivity and HA restarted them automatically. First ovsdb-server logged heartbeat issues on the UNIX domain socket, shortly after iSCSI connection started to fail, and finally ovs-vswitchd reported that the UNIX domain socket is broken, always about 30 seconds after ovsdb. It happened in the same way on each host, quite near to each other, and after restart they were stable for an hour, but then the problems started again, so the network backend were reverted to bridge. OOMkiller haven't started, so probably there were enough memory, and there is no sign the disk was filled up either. Prior to this issue I've seen these kind of messages often in the log, but I'm not sure if they have anything to do about it:

May 15 16:22:47 host ovs-vswitchd: 21539|dpif|WARN|system@xapi3: recv failed (No buffer space available) May 15 16:22:47 host ovs-vswitchd: 21540|dpif|WARN|system@xapi3: recv failed (No buffer space available)

I'm including some log excerpts from one server:

May 15 16:22:58 host ovsdb-server: 00023|reconnect|ERR|unix:/tmp/stream-unix.5393.0: no response to inactivity probe after 5.01 seconds, disconnecting May 15 16:23:11 host kernel: [3328870.137299] connection3:0: ping timeout of 15 secs expired, recv timeout 10, last rx 332854265, last ping 332855265, now 332856765 May 15 16:23:11 host kernel: [3328870.137323] connection3:0: detected conn error (1011) May 15 16:23:12 host iscsid: Kernel reported iSCSI connection 3:0 error (1011) state (3) May 15 16:23:12 host kernel: [3328871.019145] connection2:0: ping timeout of 15 secs expired, recv timeout 10, last rx 332854353, last ping 332855353, now 332856853 May 15 16:23:12 host kernel: [3328871.019167] connection2:0: detected conn error (1011) May 15 16:23:13 host iscsid: Kernel reported iSCSI connection 2:0 error (1011) state (3)
...
May 15 16:23:27 host ovs-vswitchd: 21541|jsonrpc|WARN|unix:/var/run/openvswitch/db.sock: send error: Broken pipe May 15 16:23:27 host ovs-vswitchd: 21542|timeval|WARN|40509 ms poll interval (0 ms user, 32590 ms system) is over 4324 times the weighted mean interval 2 ms (52404048 samples) May 15 16:23:27 host ovs-vswitchd: 21543|timeval|WARN|context switches: 0 voluntary, 879 involuntary May 15 16:23:27 host ovs-vswitchd: 21544|coverage|INFO|Event coverage (epoch 52404048/entire run), hash=37fe45f6: May 15 16:23:27 host ovs-vswitchd: 21545|coverage|INFO|ofproto_dpif_xlate 4617 / 493556267 May 15 16:23:27 host ovs-vswitchd: 21546|coverage|INFO|facet_revalidate 1 / 3332257 May 15 16:23:27 host ovs-vswitchd: 21547|coverage|INFO|dpif_flow_del 1786 / 91737202 May 15 16:23:27 host ovs-vswitchd: 21548|coverage|INFO|flow_extract 150 / 99810556 May 15 16:23:27 host ovs-vswitchd: 21549|coverage|INFO|hmap_expand 53 / 828100019 May 15 16:23:27 host ovs-vswitchd: 21550|coverage|INFO|mac_learning_learned 1 / 1950471 May 15 16:23:27 host ovs-vswitchd: 21551|coverage|INFO|mac_learning_expired 33 / 1949928 May 15 16:23:27 host ovs-vswitchd: 21552|coverage|INFO|netdev_get_stats 15 / 7983394 May 15 16:23:27 host ovs-vswitchd: 21553|coverage|INFO|poll_fd_wait 10 / 577246726 May 15 16:23:27 host ovs-vswitchd: 21554|coverage|INFO|poll_zero_timeout 1 / 537392 May 15 16:23:27 host ovs-vswitchd: 21555|coverage|INFO|unixctl_received 1 / 663108 May 15 16:23:27 host ovs-vswitchd: 21556|coverage|INFO|unixctl_replied 1 / 663108 May 15 16:23:27 host ovs-vswitchd: 21557|coverage|INFO|util_xalloc 23796 / 22430418786 May 15 16:23:27 host ovs-vswitchd: 21558|coverage|INFO|netdev_ethtool 30 / 15967005 May 15 16:23:27 host ovs-vswitchd: 21559|coverage|INFO|netlink_received 3888 / 401972766 May 15 16:23:27 host ovs-vswitchd: 21560|coverage|INFO|netlink_recv_jumbo 92 / 31719860 May 15 16:23:27 host ovs-vswitchd: 21561|coverage|INFO|netlink_sent 1983 / 303430371 May 15 16:23:27 host ovs-vswitchd: 21562|coverage|INFO|bridge_reconfigure 0 / 57 May 15 16:23:27 host ovs-vswitchd: 21563|coverage|INFO|ofproto_flush 0 / 1 May 15 16:23:27 host ovs-vswitchd: 21564|coverage|INFO|ofproto_recv_openflow 0 / 82 May 15 16:23:27 host ovs-vswitchd: 21565|coverage|INFO|ofproto_update_port 0 / 175 May 15 16:23:27 host ovs-vswitchd: 21566|coverage|INFO|facet_unexpected 0 / 9856 May 15 16:23:27 host ovs-vswitchd: 21567|coverage|INFO|dpif_port_add 0 / 34 May 15 16:23:27 host ovs-vswitchd: 21568|coverage|INFO|dpif_flow_flush 0 / 2 May 15 16:23:27 host ovs-vswitchd: 21569|coverage|INFO|dpif_flow_put 0 / 302480 May 15 16:23:27 host ovs-vswitchd: 21570|coverage|INFO|dpif_execute 0 / 2 May 15 16:23:27 host ovs-vswitchd: 21571|coverage|INFO|dpif_purge 0 / 1 May 15 16:23:27 host ovs-vswitchd: 21572|coverage|INFO|hmap_pathological 0 / 2439 May 15 16:23:27 host ovs-vswitchd: 21573|coverage|INFO|rconn_queued 0 / 41 May 15 16:23:27 host ovs-vswitchd: 21574|coverage|INFO|rconn_sent 0 / 41 May 15 16:23:27 host ovs-vswitchd: 21575|coverage|INFO|pstream_open 0 / 2 May 15 16:23:27 host ovs-vswitchd: 21576|coverage|INFO|stream_open 0 / 1 May 15 16:23:27 host ovs-vswitchd: 21577|coverage|INFO|vconn_received 0 / 123 May 15 16:23:27 host ovs-vswitchd: 21578|coverage|INFO|vconn_sent 0 / 82 May 15 16:23:27 host ovs-vswitchd: 21579|coverage|INFO|netdev_set_policing 0 / 601 May 15 16:23:27 host ovs-vswitchd: 21580|coverage|INFO|netdev_get_ifindex 0 / 71 May 15 16:23:27 host ovs-vswitchd: 21581|coverage|INFO|netdev_get_hwaddr 0 / 155 May 15 16:23:27 host ovs-vswitchd: 21582|coverage|INFO|netdev_set_hwaddr 0 / 3 May 15 16:23:27 host ovs-vswitchd: 21583|coverage|INFO|nln_changed 0 / 312 May 15 16:23:27 host ovs-vswitchd: 21584|coverage|INFO|netlink_overflow 0 / 127
May 15 16:23:27 host ovs-vswitchd: 21585|coverage|INFO|38 events never hit
May 15 16:23:27 host ovs-vswitchd: 21586|poll_loop|WARN|wakeup due to [POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at lib/dpif-linux.c:1197 (76% CPU usage) May 15 16:23:27 host ovs-vswitchd: 21587|reconnect|WARN|unix:/var/run/openvswitch/db.sock: connection dropped (Broken pipe) May 15 16:23:27 host ovs-vswitchd: 21588|poll_loop|WARN|wakeup due to [POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at lib/dpif-linux.c:1197 (76% CPU usage) May 15 16:23:27 host ovs-vswitchd: 21589|poll_loop|WARN|wakeup due to [POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at lib/dpif-linux.c:1197 (76% CPU usage) May 15 16:23:27 host ovs-vswitchd: 21590|poll_loop|WARN|wakeup due to [POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at lib/dpif-linux.c:1197 (76% CPU usage) May 15 16:23:27 host ovs-vswitchd: 21591|poll_loop|WARN|wakeup due to [POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at lib/dpif-linux.c:1197 (76% CPU usage) May 15 16:23:27 host ovs-vswitchd: 21592|poll_loop|WARN|wakeup due to [POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at lib/dpif-linux.c:1197 (76% CPU usage) May 15 16:23:27 host ovs-vswitchd: 21593|poll_loop|WARN|wakeup due to [POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at lib/dpif-linux.c:1197 (76% CPU usage) May 15 16:23:27 host ovs-vswitchd: 21594|poll_loop|WARN|wakeup due to [POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at lib/dpif-linux.c:1197 (76% CPU usage) May 15 16:23:27 host ovs-vswitchd: 21595|poll_loop|WARN|wakeup due to [POLLIN] on fd 18 (unknown anon_inode:[eventpoll]) at lib/dpif-linux.c:1197 (76% CPU usage)
...
[these kept coming for a while before restart]

Anyone has any idea where to look further?

Regards,

Zoli
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to