Seems like bug http://bugs.clusterlabs.org/show_bug.cgi?id=5040 and and earlier thread: http://thread.gmane.org/gmane.linux.highavailability.pacemaker/13185/focus=13321

According to that bug, 1.4.3 may have solved it, yet still open and a comment from Andrew Beekhof saying he'd reproduced again on 4/18. From the thread, maybe pacemaker 1.1.7 with a commit by Andrew, but he sees some behavior.

OS: CentOS 5.7 x86_64
pacemaker 1.1.6
glue: 1.0.9
corosync 1.4.2
 - all RPMs were built from source and stored locally for deployment.

nodes: omc1 and omc2: both virtual machines on CentOS 5.7.

Resources: mainly a floating IP, mysql and httpd along with a few custom services - seemed simple. No shared storage.

This seems like a pretty critical bug. I've not been able to reproduce it in the lab (of course not) but my production cluster is running on a single cylinder. I do have logs from the event that seemed to cause it if they'd help (prefer pastebin? here on the list?); I've tried to dump the collection of logs with crm_report but never seem to wind up with anything in the archives it creates . I'm currently building and testing 1.4.3 but since I can't reproduce, I'm less than thrilled about the prospects and feeling confident.

Secondly - any recommended process to bring the messed up node back into the cluster game? I've probably horked it beyond recognition with shutdowns/crm commands/rm crm configs, editing the cib with cibadmin and trying to replace it based on other threads and advice. I currently have pacemaker and corosync services shut off - too terrifying to contemplate it killing my active node by interacting with it.

Let me know what info would help...

Thanks,

Brent

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to