Hi all, it appears as though this is the problem. The /etc/hosts file specifies per-interface hostnames e.g.
192.168.200.170 r4-eth1 This explains the difference in the hostname that appears to be causing a problem. I have used a nodelist to specify the nodes of the cluster, their ids and their names. This seems to have resolved the problem. I haven't been able to do enough definitive testing. The "nodelist" feature is entirely undocumented, a look at the source code confirmed that there was in fact a "name" field that would be looked for in the config. When will the documentation be updated? I understand that the logs were displaying the warning signs of something being wrong with the configuration, but it wasn't really enough to be able to source the problem. Maybe this could be looked into? Regards, James On 10/30/2012 01:03 PM, Michael Schwartzkopff wrote: >> Hi Michael, >> >> I have managed to successfully configure corosync with udpu, it >> unfortunately hasn't made a difference in the behaviour of the cluster. >> >> I have found that I don't even need to restart the host in order to get >> this behaviour - all I need to do is stop and restart corosync and >> pacemaker on *one* of the hosts. To be precise: I've been able to narrow >> it down to only one of the two hosts (r3). If I reboot the host, or >> restart the services on r4 everything works fine. If I try the same with >> r3, I have problems. >> >> I feel as though the answer may lie in the logfiles, the >> intercommunication between the individual components of the HA software >> makes it a bit difficult to accurately read the logfiles as an outsider >> to this software. I have attached the logs of both r3 and r4 after >> reproducing this effect this afternoon, they are much shorter to read >> than those previously: >> >> corosync-r3.log: http://pastebin.com/ZAhh5nax >> corosync-r4.log: http://pastebin.com/SETtqnZM >> >> Are there any other steps I could take in debugging this behaviour? >> >> Regards, >> James > > hi, > > I think you have a problem in the nameing of your clusters. In the first log > it learns the name from DNS: > > Oct 29 13:41:14 [21723] r3 crmd: notice: corosync_node_name: > Inferred node name 'r4-eth1' for nodeid 2 from DNS > > if that does not fit to the name of the node it might cause the problems. > > Greetings, > > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
