On 06/27/2012 12:14 AM, Marcus Bointon wrote:
> 
> On 26 Jun 2012, at 22:18, Andreas Kurz wrote:
> 
>> use STONITH to prevent resources running on both nodes ... you
>> configured redundant cluster communication paths?
> 
> The nodes in question are Linode VMs, so not much opportunity for that.
> 
>> With heartbeat you can use the "cl_status" command with its various
>> options to check Heartbeats view of the cluster .... and heartbeats log
>> messages from the split-brain event should also give you some hints.
> 
> cl_status just confirms that each node thinks the other is dead.
> 
> ok, I see two things happening in the logs: At one point proxy2 reported a 
> slow heartbeat (20sec, deadtime was set to 15) but seemed to reconnect.
> 
> Later on, both nodes reported each other as dead within the same second:
> 
> Jun 25 10:14:16 proxy1 heartbeat: [2678]: WARN: node proxy2.example.com: is 
> dead
> Jun 25 10:14:16 proxy1 heartbeat: [2678]: info: Link proxy2.example.com:eth0 
> dead.
> Jun 25 10:14:16 proxy1 crmd: [3205]: notice: crmd_ha_status_callback: Status 
> update: Node proxy2.example.com now has status [dead]

looks like a network problem, yes

> 
> As I understand it, STONITH is intended to prevent a node rejoining in case 
> it causes more trouble. In this case the individual nodes were fine, it 
> appeared to be the network that was at fault. Why wouldn't these nodes 
> automatically reconnect, given that there is no STONITH to prevent them? How 
> should I tell them to reconnect manually?
> 

STONITH is to make sure a node is really dead before acquiring its
resources ... without stonith and ignored quorum, nodes don't care.

If the network is working as expected again, Heartbeat should reconnect
automatically ... if not, restart Heartbeat if you are confident the
network problem is solved.

Regards,
Andreas

> I can also see that it failed to send alerts from the email resources at the 
> same time because DNS lookups were failing: all points to a wider network 
> issue.
> 
> I wonder if Linode has micro-outages on their network since we've also been 
> seeing some problems with mmm reporting 'network unreachable' on some other 
> instances at the same time.
> 
> Marcus
> 



-- 
Need help with Pacemaker?
http://www.hastexo.com/now


Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to