Le 09/10/2015 20:27, Gilou a écrit : > Le 09/10/2015 20:14, Gilou a écrit : >> Le 09/10/2015 18:36, Gilou a écrit : >>> Le 09/10/2015 18:21, Dietmar Maurer a écrit : >>>>> So I tried again.. HA doesn't work. >>>>> Both resources are now frozen (?), and they didn't restart... Even after >>>>> 5 minutes... >>>>> service vm:102 (pve1, freeze) >>>>> service vm:303 (pve1, freeze) >>>> >>>> The question is why they are frozen. The only action which >>>> puts them to 'freeze' is when you shutdown a node. >>>> >>> >>> I pulled the ethernet cables out of the to-be-failing node when I >>> tested. It didn't shut down. I plugged them back in 20 minutes later. >>> They were down (so I guess the fencing worked). But still? >>> >> >> OK, so I reinstalled fresh from the PVE 4 ISO 3 nodes, that are using >> one single NIC to communicate with a NFS server and themselves. Cluster >> is up, and one VM is protected: >> # ha-manager status >> quorum OK >> master pve1 (active, Fri Oct 9 19:55:06 2015) >> lrm pve1 (active, Fri Oct 9 19:55:12 2015) >> lrm pve2 (active, Fri Oct 9 19:55:07 2015) >> lrm pve3 (active, Fri Oct 9 19:55:10 2015) >> service vm:100 (pve2, started) >> # pvecm status >> Quorum information >> ------------------ >> Date: Fri Oct 9 19:55:22 2015 >> Quorum provider: corosync_votequorum >> Nodes: 3 >> Node ID: 0x00000001 >> Ring ID: 12 >> Quorate: Yes >> >> Votequorum information >> ---------------------- >> Expected votes: 3 >> Highest expected: 3 >> Total votes: 3 >> Quorum: 2 >> Flags: Quorate >> >> Membership information >> ---------------------- >> Nodeid Votes Name >> 0x00000002 1 192.168.44.129 >> 0x00000003 1 192.168.44.132 >> 0x00000001 1 192.168.44.143 (local) >> >> One one of the nodes, incidentally, the one running the HA VM, I already >> get those: >> Oct 09 19:55:07 pve2 pve-ha-lrm[1211]: watchdog update failed - Broken pipe >> >> Not good. >> I tried to migrate to pve1 to see what happens: >> Executing HA migrate for VM 100 to node pve1 >> unable to open file '/etc/pve/ha/crm_commands.tmp.3377' - No such file >> or directory >> TASK ERROR: command 'ha-manager migrate vm:100 pve1' failed: exit code 2 >> >> OK.. so we can't migrate running HA VMs ? What did I get wrong here? >> So. I remove the VM from HA, I migrate it on pve1, see what happens. It >> works. OK. I stop the VM. Enable HA. It won't start. >> service vm:100 (pve1, freeze) >> >> OK. And now, on pve1: >> Oct 09 19:59:16 pve1 pve-ha-crm[1202]: watchdog update failed - Broken pipe >> >> OK... Let's try pve3, cold migrate, without ha, enable ha again.. >> interesting, now we have: >> # ha-manager status >> quorum OK >> master pve1 (active, Fri Oct 9 20:09:46 2015) >> lrm pve1 (old timestamp - dead?, Fri Oct 9 19:58:57 2015) >> lrm pve2 (active, Fri Oct 9 20:09:47 2015) >> lrm pve3 (active, Fri Oct 9 20:09:50 2015) >> service vm:100 (pve3, started) >> >> Why is pve1 not reporting properly... >> >> And now on 3 nodes: >> Oct 09 20:10:40 pve3 pve-ha-lrm[1208]: watchdog update failed - Broken pipe >> Oct 09 20:10:50 pve3 pve-ha-lrm[1208]: watchdog update failed - Broken pipe >> Oct 09 20:11:00 pve3 pve-ha-lrm[1208]: watchdog update failed - Broken pipe >> >> Wtf? omping reports multicast is getting through, but I'm not sure what >> would be the issue there... It worked on 3.4 on the same physical setup. >> So ? >> >> > > Well, then I still tried to see some failover, so I unplugged pve3 which > had the VM, something happened: > > Oct 9 20:18:26 pve1 pve-ha-crm[1202]: node 'pve3': state changed from > 'online' => 'unknown' > Oct 9 20:19:16 pve1 pve-ha-crm[1202]: service 'vm:100': state changed > from 'started' to 'fence' > Oct 9 20:19:16 pve1 pve-ha-crm[1202]: node 'pve3': state changed from > 'unknown' => 'fence' > Oct 9 20:20:26 pve1 pve-ha-crm[1202]: successfully acquired lock > 'ha_agent_pve3_lock' > Oct 9 20:20:26 pve1 pve-ha-crm[1202]: fencing: acknowleged - got agent > lock for node 'pve3' > Oct 9 20:20:26 pve1 pve-ha-crm[1202]: node 'pve3': state changed from > 'fence' => 'unknown' > Oct 9 20:20:26 pve1 pve-ha-crm[1202]: service 'vm:100': state changed > from 'fence' to 'stopped' > Oct 9 20:20:36 pve1 pve-ha-crm[1202]: watchdog update failed - Broken pipe > Oct 9 20:20:36 pve1 pve-ha-crm[1202]: service 'vm:100': state changed > from 'stopped' to 'started' (node = pve1) > Oct 9 20:20:36 pve1 pve-ha-crm[1202]: service 'vm:100': state changed > from 'started' to 'freeze' > > OK, frozen. great. > root@pve1:~# ha-manager status > quorum OK > master pve1 (active, Fri Oct 9 20:23:26 2015) > lrm pve1 (old timestamp - dead?, Fri Oct 9 19:58:57 2015) > lrm pve2 (active, Fri Oct 9 20:23:27 2015) > lrm pve3 (old timestamp - dead?, Fri Oct 9 20:18:10 2015) > service vm:100 (pve1, freeze) > > What to do? > (Then starting manually doesn't work.. only way is to pull it out of > HA... all the same circus).
As far as multicast goes: % ansible -a "omping -m 239.192.6.92 -c 10000 -i 0.001 -F -q pve1 pve2 pve3" -f 3 -i 'pve1,pve2,pve3' all -u root pve3 | success | rc=0 >> pve1 : waiting for response msg pve2 : waiting for response msg pve1 : joined (S,G) = (*, 239.192.6.92), pinging pve2 : joined (S,G) = (*, 239.192.6.92), pinging pve1 : given amount of query messages was sent pve2 : given amount of query messages was sent pve1 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.084/0.145/0.652/0.029 pve1 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.085/0.149/0.666/0.030 pve2 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.086/0.147/0.300/0.029 pve2 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.087/0.151/0.301/0.029 pve2 | success | rc=0 >> pve1 : waiting for response msg pve3 : waiting for response msg pve1 : joined (S,G) = (*, 239.192.6.92), pinging pve3 : waiting for response msg pve3 : joined (S,G) = (*, 239.192.6.92), pinging pve3 : waiting for response msg pve3 : server told us to stop pve1 : given amount of query messages was sent pve1 : unicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.071/0.149/0.637/0.032 pve1 : multicast, xmt/rcv/%loss = 10000/10000/0%, min/avg/max/std-dev = 0.090/0.154/0.638/0.034 pve3 : unicast, xmt/rcv/%loss = 8664/8664/0%, min/avg/max/std-dev = 0.087/0.149/0.947/0.033 pve3 : multicast, xmt/rcv/%loss = 8664/8664/0%, min/avg/max/std-dev = 0.092/0.154/0.948/0.033 pve1 | success | rc=0 >> pve2 : waiting for response msg pve3 : waiting for response msg pve2 : waiting for response msg pve3 : waiting for response msg pve2 : joined (S,G) = (*, 239.192.6.92), pinging pve3 : joined (S,G) = (*, 239.192.6.92), pinging pve3 : waiting for response msg pve3 : server told us to stop pve2 : waiting for response msg pve2 : server told us to stop pve2 : unicast, xmt/rcv/%loss = 8540/8540/0%, min/avg/max/std-dev = 0.080/0.149/0.312/0.030 pve2 : multicast, xmt/rcv/%loss = 8540/8540/0%, min/avg/max/std-dev = 0.091/0.153/0.325/0.031 pve3 : unicast, xmt/rcv/%loss = 8141/8141/0%, min/avg/max/std-dev = 0.089/0.148/0.980/0.032 pve3 : multicast, xmt/rcv/%loss = 8141/8141/0%, min/avg/max/std-dev = 0.091/0.154/0.994/0.032 And for 10 mins.. % ansible -a "omping -c 600 -i 1 -q pve1 pve2 pve3" -f 3 -i 'pve1,pve2,pve3' all -u root pve2 | success | rc=0 >> pve1 : waiting for response msg pve3 : waiting for response msg pve3 : joined (S,G) = (*, 232.43.211.234), pinging pve1 : joined (S,G) = (*, 232.43.211.234), pinging pve1 : given amount of query messages was sent pve3 : given amount of query messages was sent pve1 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.108/0.215/0.343/0.046 pve1 : multicast, xmt/rcv/%loss = 600/599/0% (seq>=2 0%), min/avg/max/std-dev = 0.119/0.222/0.346/0.048 pve3 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.098/0.221/0.355/0.049 pve3 : multicast, xmt/rcv/%loss = 600/599/0% (seq>=2 0%), min/avg/max/std-dev = 0.118/0.226/0.370/0.050 pve1 | success | rc=0 >> pve2 : waiting for response msg pve3 : waiting for response msg pve2 : waiting for response msg pve3 : waiting for response msg pve2 : joined (S,G) = (*, 232.43.211.234), pinging pve3 : joined (S,G) = (*, 232.43.211.234), pinging pve2 : given amount of query messages was sent pve3 : given amount of query messages was sent pve2 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.107/0.221/0.343/0.050 pve2 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.110/0.227/0.344/0.052 pve3 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.098/0.224/0.328/0.050 pve3 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.114/0.229/0.335/0.050 pve3 | success | rc=0 >> pve1 : waiting for response msg pve2 : waiting for response msg pve1 : joined (S,G) = (*, 232.43.211.234), pinging pve2 : waiting for response msg pve2 : joined (S,G) = (*, 232.43.211.234), pinging pve1 : given amount of query messages was sent pve2 : waiting for response msg pve2 : server told us to stop pve1 : unicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.113/0.213/0.335/0.048 pve1 : multicast, xmt/rcv/%loss = 600/600/0%, min/avg/max/std-dev = 0.114/0.220/0.347/0.052 pve2 : unicast, xmt/rcv/%loss = 599/599/0%, min/avg/max/std-dev = 0.111/0.210/0.320/0.048 pve2 : multicast, xmt/rcv/%loss = 599/599/0%, min/avg/max/std-dev = 0.115/0.216/0.332/0.049 I'm sad! And I'm leaving for the weekend. My lab should stay around for a while, but this is not really good looking :( Cheers Gilles _______________________________________________ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel