Hi Guys, I have just had a VMHost randomly disconnect in production and subsequently take down some VMs. I have attached the logs (happened to be running agent trace on this node), but it would seem that the agent (or management?) waited 25 seconds before erroring, and then the cloudstack agent froze until 1800. I assume the agent syslog stack traces were caused by force closes of VMs, no other nodes were affected during this time period.
While the host was in disconnect mode, I could connect to a VM which was running on that host, although Cloudstack was already reporting that is was down. Would it be a good idea to ping VM's (their allocated IPs before attempting to start them on other nodes - especially in a HA setup)? If someone could look at the logs and let me know if there is something obvious it would be most appreciated, I have included the management bond for reference that the link didn't go down. Thanks in advance, Marty
Aug 17 17:40:11 aurora dnsmasq-dhcp[6150]: DHCP packet received on brbond0-152 which has no address Aug 17 17:40:11 aurora dnsmasq-dhcp[6150]: DHCP packet received on brbond0-152 which has no address Aug 17 17:41:53 aurora puppet-agent[9852]: Finished catalog run in 7.21 seconds Aug 17 17:59:01 aurora kernel: [522967.085451] INFO: task kvm:8402 blocked for more than 120 seconds. Aug 17 17:59:01 aurora kernel: [522967.085788] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 17 17:59:01 aurora kernel: [522967.086214] kvm D ffffffff8180cbe0 0 8402 1 0x00000000 Aug 17 17:59:01 aurora kernel: [522967.086219] ffff881cf3c43cd8 0000000000000082 ffff881cf3c43cb8 ffffffff81080cd8 Aug 17 17:59:01 aurora kernel: [522967.086226] ffff881cf3c43fd8 ffff881cf3c43fd8 ffff881cf3c43fd8 00000000000139c0 Aug 17 17:59:01 aurora kernel: [522967.086230] ffff881f92dc9700 ffff881c0cdfdc00 0000000000000282 ffff883f912d0a80 Aug 17 17:59:01 aurora kernel: [522967.086235] Call Trace: Aug 17 17:59:01 aurora kernel: [522967.086244] [<ffffffff81080cd8>] ? __wake_up_common+0x58/0x90 Aug 17 17:59:01 aurora kernel: [522967.086250] [<ffffffff8169bee9>] schedule+0x29/0x70 Aug 17 17:59:01 aurora kernel: [522967.086255] [<ffffffff81058f15>] exit_mm+0x85/0x130 Aug 17 17:59:01 aurora kernel: [522967.086259] [<ffffffff81059131>] do_exit+0x171/0x480 Aug 17 17:59:01 aurora kernel: [522967.086264] [<ffffffff81066cfa>] ? __dequeue_signal+0x6a/0xb0 Aug 17 17:59:01 aurora kernel: [522967.086267] [<ffffffff810595e4>] do_group_exit+0x44/0xa0 Aug 17 17:59:01 aurora kernel: [522967.086271] [<ffffffff810697fb>] get_signal_to_deliver+0x22b/0x440 Aug 17 17:59:01 aurora kernel: [522967.086277] [<ffffffff810147e9>] do_signal+0x29/0x130 Aug 17 17:59:01 aurora kernel: [522967.086282] [<ffffffff810affcc>] ? do_futex+0x7c/0x1b0 Aug 17 17:59:01 aurora kernel: [522967.086288] [<ffffffff8119a10a>] ? do_vfs_ioctl+0x8a/0x340 Aug 17 17:59:01 aurora kernel: [522967.086291] [<ffffffff810b0247>] ? sys_futex+0x147/0x1a0 Aug 17 17:59:01 aurora kernel: [522967.086295] [<ffffffff810149a0>] do_notify_resume+0x90/0xd0 Aug 17 17:59:01 aurora kernel: [522967.086300] [<ffffffff816a5922>] int_signal+0x12/0x17 Aug 17 17:59:01 aurora kernel: [522967.086303] INFO: task kvm:8403 blocked for more than 120 seconds. Aug 17 17:59:01 aurora kernel: [522967.086625] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 17 17:59:01 aurora kernel: [522967.087050] kvm D ffffffff8180cbe0 0 8403 1 0x00000000 Aug 17 17:59:01 aurora kernel: [522967.087053] ffff881f7913bcd8 0000000000000082 0000000000000000 ffffffffffffffe0 Aug 17 17:59:01 aurora kernel: [522967.087058] ffff881f7913bfd8 ffff881f7913bfd8 ffff881f7913bfd8 00000000000139c0 Aug 17 17:59:01 aurora kernel: [522967.087062] ffff881f92db1700 ffff881c0cdf8000 ffff881c0cdf8000 ffff883f912d0a80 Aug 17 17:59:01 aurora kernel: [522967.087067] Call Trace: Aug 17 17:59:01 aurora kernel: [522967.087071] [<ffffffff8169bee9>] schedule+0x29/0x70 Aug 17 17:59:01 aurora kernel: [522967.087074] [<ffffffff81058f15>] exit_mm+0x85/0x130 Aug 17 17:59:01 aurora kernel: [522967.087077] [<ffffffff81059131>] do_exit+0x171/0x480 Aug 17 17:59:01 aurora kernel: [522967.087081] [<ffffffff81066cfa>] ? __dequeue_signal+0x6a/0xb0 Aug 17 17:59:01 aurora kernel: [522967.087084] [<ffffffff810595e4>] do_group_exit+0x44/0xa0 Aug 17 17:59:01 aurora kernel: [522967.087088] [<ffffffff810697fb>] get_signal_to_deliver+0x22b/0x440 Aug 17 17:59:01 aurora kernel: [522967.087092] [<ffffffff810147e9>] do_signal+0x29/0x130 Aug 17 17:59:01 aurora kernel: [522967.087095] [<ffffffff810affcc>] ? do_futex+0x7c/0x1b0 Aug 17 17:59:01 aurora kernel: [522967.087099] [<ffffffff8119a10a>] ? do_vfs_ioctl+0x8a/0x340 Aug 17 17:59:01 aurora kernel: [522967.087102] [<ffffffff810b0247>] ? sys_futex+0x147/0x1a0 Aug 17 17:59:01 aurora kernel: [522967.087106] [<ffffffff810149a0>] do_notify_resume+0x90/0xd0 Aug 17 17:59:01 aurora kernel: [522967.087109] [<ffffffff816a5922>] int_signal+0x12/0x17 Aug 17 17:59:01 aurora kernel: [522967.087112] INFO: task kvm:17450 blocked for more than 120 seconds. Aug 17 17:59:01 aurora kernel: [522967.087436] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 17 17:59:01 aurora kernel: [522967.098800] kvm D ffffffff8180cbe0 0 17450 1 0x00000000 Aug 17 17:59:01 aurora kernel: [522967.098823] ffff88140dfc3cd8 0000000000000082 0000000000000000 ffffffffffffffe0 Aug 17 17:59:01 aurora kernel: [522967.098856] ffff88140dfc3fd8 ffff88140dfc3fd8 ffff88140dfc3fd8 00000000000139c0 Aug 17 17:59:01 aurora kernel: [522967.098872] ffff881f9318c500 ffff881f8fe9dc00 ffff881f8fe9dc00 ffff883f912d0a80 Aug 17 17:59:01 aurora kernel: [522967.098877] Call Trace: Aug 17 17:59:01 aurora kernel: [522967.098882] [<ffffffff8169bee9>] schedule+0x29/0x70 Aug 17 17:59:01 aurora kernel: [522967.098886] [<ffffffff81058f15>] exit_mm+0x85/0x130 Aug 17 17:59:01 aurora kernel: [522967.098889] [<ffffffff81059131>] do_exit+0x171/0x480 Aug 17 17:59:01 aurora kernel: [522967.098893] [<ffffffff81066cfa>] ? __dequeue_signal+0x6a/0xb0 Aug 17 17:59:01 aurora kernel: [522967.098897] [<ffffffff810595e4>] do_group_exit+0x44/0xa0 Aug 17 17:59:01 aurora kernel: [522967.098901] [<ffffffff810697fb>] get_signal_to_deliver+0x22b/0x440 Aug 17 17:59:01 aurora kernel: [522967.098905] [<ffffffff810147e9>] do_signal+0x29/0x130 Aug 17 17:59:01 aurora kernel: [522967.098909] [<ffffffff810affcc>] ? do_futex+0x7c/0x1b0 Aug 17 17:59:01 aurora kernel: [522967.098912] [<ffffffff810b0247>] ? sys_futex+0x147/0x1a0 Aug 17 17:59:01 aurora kernel: [522967.098917] [<ffffffff81188550>] ? vfs_write+0x110/0x180 Aug 17 17:59:01 aurora kernel: [522967.098921] [<ffffffff810149a0>] do_notify_resume+0x90/0xd0 Aug 17 17:59:01 aurora kernel: [522967.098925] [<ffffffff816a5922>] int_signal+0x12/0x17 Aug 17 18:00:32 aurora kernel: [523057.947100] brbond0-152: port 3(vnet2) entered disabled state Aug 17 18:00:32 aurora kernel: [523057.951419] brbond0-152: port 3(vnet2) entered disabled state Aug 17 18:00:32 aurora kernel: [523057.952182] device vnet2 left promiscuous mode Aug 17 18:00:32 aurora kernel: [523057.952186] brbond0-152: port 3(vnet2) entered disabled state Aug 17 18:00:34 aurora ntpd[5738]: Deleting interface #75 vnet2, fe80::fc89:aaff:fe00:5fa#123, interface stats: received=0, sent=0, dropped=0, active_time=1284 secs Aug 17 18:00:34 aurora ntpd[5738]: peers refreshed Aug 17 18:00:39 aurora kernel: [523064.894862] cloud0: port 3(vnet10) entered disabled state Aug 17 18:00:39 aurora kernel: [523064.899361] cloud0: port 3(vnet10) entered disabled state Aug 17 18:00:39 aurora kernel: [523064.900143] device vnet10 left promiscuous mode Aug 17 18:00:39 aurora kernel: [523064.900146] cloud0: port 3(vnet10) entered disabled state Aug 17 18:00:39 aurora kernel: [523064.957760] brbond0-153: port 2(vnet11) entered disabled state Aug 17 18:00:39 aurora kernel: [523064.958224] device vnet11 left promiscuous mode Aug 17 18:00:39 aurora kernel: [523064.958230] brbond0-153: port 2(vnet11) entered disabled state Aug 17 18:00:39 aurora kernel: [523065.013749] brbond0-819: port 2(vnet12) entered disabled state Aug 17 18:00:39 aurora kernel: [523065.014508] device vnet12 left promiscuous mode Aug 17 18:00:39 aurora kernel: [523065.014517] brbond0-819: port 2(vnet12) entered disabled state Aug 17 18:00:40 aurora ntpd[5738]: Deleting interface #44 vnet12, fe80::fc00:3bff:fe0e:2#123, interface stats: received=0, sent=0, dropped=0, active_time=118026 secs Aug 17 18:00:40 aurora ntpd[5738]: Deleting interface #43 vnet11, fe80::fc77:acff:fe00:257#123, interface stats: received=0, sent=0, dropped=0, active_time=118026 secs Aug 17 18:00:40 aurora ntpd[5738]: Deleting interface #42 vnet10, fe80::fc00:a9ff:fefe:9b#123, interface stats: received=0, sent=0, dropped=0, active_time=118026 secs Aug 17 18:00:40 aurora ntpd[5738]: peers refreshed Aug 17 18:02:41 aurora kernel: [523187.658419] brbond0-152: port 4(vnet3) entered disabled state Aug 17 18:02:41 aurora kernel: [523187.662509] brbond0-152: port 4(vnet3) entered disabled state Aug 17 18:02:41 aurora kernel: [523187.663595] device vnet3 left promiscuous mode Aug 17 18:02:41 aurora kernel: [523187.663601] brbond0-152: port 4(vnet3) entered disabled state Aug 17 18:02:43 aurora ntpd[5738]: Deleting interface #23 vnet3, fe80::fc41:7cff:fe00:61a#123, interface stats: received=0, sent=0, dropped=0, active_time=522318 secs Aug 17 18:02:43 aurora ntpd[5738]: peers refreshed Aug 17 18:04:49 aurora kernel: [523315.013106] brbond0-152: port 7(vnet6) entered disabled state Aug 17 18:04:49 aurora kernel: [523315.017106] brbond0-152: port 7(vnet6) entered disabled state Aug 17 18:04:49 aurora kernel: [523315.017876] device vnet6 left promiscuous mode Aug 17 18:04:49 aurora kernel: [523315.017883] brbond0-152: port 7(vnet6) entered disabled state Aug 17 18:04:50 aurora ntpd[5738]: Deleting interface #27 vnet6, fe80::fc5e:d0ff:fe00:5ee#123, interface stats: received=0, sent=0, dropped=0, active_time=522380 secs Aug 17 18:04:50 aurora ntpd[5738]: peers refreshed Aug 17 18:04:58 aurora kernel: [523324.135111] brbond0-851: port 2(vnet8) entered disabled state Aug 17 18:04:58 aurora kernel: [523324.138860] brbond0-851: port 2(vnet8) entered disabled state Aug 17 18:04:58 aurora kernel: [523324.139636] device vnet8 left promiscuous mode Aug 17 18:04:58 aurora kernel: [523324.139642] brbond0-851: port 2(vnet8) entered disabled state Aug 17 18:04:58 aurora kernel: [523324.173700] cloud0: port 2(vnet9) entered disabled state Aug 17 18:04:58 aurora kernel: [523324.174395] device vnet9 left promiscuous mode Aug 17 18:04:58 aurora kernel: [523324.174399] cloud0: port 2(vnet9) entered disabled state Aug 17 18:04:59 aurora ntpd[5738]: Deleting interface #41 vnet9, fe80::fc00:a9ff:fefe:1f6#123, interface stats: received=0, sent=0, dropped=0, active_time=118355 secs Aug 17 18:04:59 aurora ntpd[5738]: Deleting interface #40 vnet8, fe80::fcd7:7aff:fe00:6ea#123, interface stats: received=0, sent=0, dropped=0, active_time=118355 secs Aug 17 18:04:59 aurora ntpd[5738]: peers refreshed Aug 17 18:07:01 aurora kernel: [523447.242027] brbond0-152: port 5(vnet4) entered disabled state Aug 17 18:07:01 aurora kernel: [523447.246056] brbond0-152: port 5(vnet4) entered disabled state Aug 17 18:07:01 aurora kernel: [523447.246740] device vnet4 left promiscuous mode Aug 17 18:07:01 aurora kernel: [523447.246746] brbond0-152: port 5(vnet4) entered disabled state Aug 17 18:07:02 aurora ntpd[5738]: Deleting interface #25 vnet4, fe80::fcbb:a2ff:fe00:60d#123, interface stats: received=0, sent=0, dropped=0, active_time=522573 secs Aug 17 18:07:02 aurora ntpd[5738]: peers refreshed
root@aurora:~# cat /proc/net/bonding/management0 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: eth1 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 0 Down Delay (ms): 0 Slave Interface: eth1 MII Status: up Speed: 1000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: ac:16:2d:77:21:51 Slave queue ID: 0 Slave Interface: eth0 MII Status: up Speed: 1000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: ac:16:2d:77:21:50 Slave queue ID: 0
