Re: [Linux-HA] Node remains offline after host restart

James Guthrie Wed, 31 Oct 2012 05:11:46 -0700

Hi all,

it appears as though this is the problem. The /etc/hosts file specifies 
per-interface hostnames e.g.


192.168.200.170         r4-eth1

This explains the difference in the hostname that appears to be causing 
a problem.

I have used a nodelist to specify the nodes of the cluster, their ids 
and their names. This seems to have resolved the problem. I haven't been 
able to do enough definitive testing.

The "nodelist" feature is entirely undocumented, a look at the source 
code confirmed that there was in fact a "name" field that would be 
looked for in the config. When will the documentation be updated?

I understand that the logs were displaying the warning signs of 
something being wrong with the configuration, but it wasn't really 
enough to be able to source the problem. Maybe this could be looked into?

Regards,
James


On 10/30/2012 01:03 PM, Michael Schwartzkopff wrote:
>> Hi Michael,
>>
>> I have managed to successfully configure corosync with udpu, it
>> unfortunately hasn't made a difference in the behaviour of the cluster.
>>
>> I have found that I don't even need to restart the host in order to get
>> this behaviour - all I need to do is stop and restart corosync and
>> pacemaker on *one* of the hosts. To be precise: I've been able to narrow
>> it down to only one of the two hosts (r3). If I reboot the host, or
>> restart the services on r4 everything works fine. If I try the same with
>> r3, I have problems.
>>
>> I feel as though the answer may lie in the logfiles, the
>> intercommunication between the individual components of the HA software
>> makes it a bit difficult to accurately read the logfiles as an outsider
>> to this software. I have attached the logs of both r3 and r4 after
>> reproducing this effect this afternoon, they are much shorter to read
>> than those previously:
>>
>> corosync-r3.log: http://pastebin.com/ZAhh5nax
>> corosync-r4.log: http://pastebin.com/SETtqnZM
>>
>> Are there any other steps I could take in debugging this behaviour?
>>
>> Regards,
>> James
>
> hi,
>
> I think you have a problem in the nameing of your clusters. In the first log
> it learns the name from DNS:
>
> Oct 29 13:41:14 [21723] r3       crmd:   notice: corosync_node_name:
>   Inferred node name 'r4-eth1' for nodeid 2 from DNS
>
> if that does not fit to the name of the node it might cause the problems.
>
> Greetings,
>
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Node remains offline after host restart

Reply via email to