On 06/11/2013 11:04 AM, Stefan Schloesser wrote:
Hi,

I have a setup with 2 nodes, drbd, mysql and apache. Rather too often for my 
liking (1 per month) one node is killed (fenced) by the other. Each time I am 
unable to find out what actually caused this behaviour.
I can see in the logs that suddenly one node is fenced or stonith but no error 
appears as to why this happens.
Each time I can simple start the node and corosync and everything works fine 
again i.e. no fault is apparent.

I already thought about auto starting corosync, but that does seem like a good 
idea. I tried trimming the communication params (totem) to no avail.

So my question is this. What's the best way to finde the cause?

Stefan Schlösser

This sounds like a problem with the network. Do you see something like "token didn't arrive in time" (I'm guessing on the wording) on the surviving node?

You might want to check that you have persistent multicast groups set in your switch(es). You might also want to setup bonding on the corosync interface (Active/Passive is best) and/or redundant ring protocol.

It might be that something on the failed node tried to log but was fenced before the buffer wrote out to the logs?

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without access to education?
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to