bump. Could someone please review the logs in the links below and tell me what the heck is going on with this cluster? I've never encountered anything like this before. Basically, corosync thinks the cluster is healthy but Pacemaker won't elect a DC!
-- Hi Andrew, > would love to see the logs from ha09b Below are links to a clean set of logs from nodes ha09a and ha09b. The procedure I followed to collect the logs was: 1. Ensure pacemakerd and corosync are stopped on both nodes. 2. Remove corosync.log on both nodes. 3. Start corosync on ha09a. 4. Wait approx. 15 seconds. 5. Start corosync on ha09b. 6. Wait approx 15 seconds. 7. Use 'corosync-cfgtool -s' to verify rings active and without faults on both nodes. 8. Use 'corosync-objctl|grep member' on both nodes to verify cluster membership. (The nodeids look good to me.) [root@ha09a log]# corosync-objctl |grep member runtime.totem.pg.mrp.srp.members.973777088.ip=r(0) ip(192.168.10.58) r(1) ip(198.51.100.58) runtime.totem.pg.mrp.srp.members.973777088.join_count=1 runtime.totem.pg.mrp.srp.members.973777088.status=joined runtime.totem.pg.mrp.srp.members.990554304.ip=r(0) ip(192.168.10.59) r(1) ip(198.51.100.59) runtime.totem.pg.mrp.srp.members.990554304.join_count=1 runtime.totem.pg.mrp.srp.members.990554304.status=joined [root@ha09b log]# corosync-objctl |grep member runtime.totem.pg.mrp.srp.members.990554304.ip=r(0) ip(192.168.10.59) r(1) ip(198.51.100.59) runtime.totem.pg.mrp.srp.members.990554304.join_count=1 runtime.totem.pg.mrp.srp.members.990554304.status=joined runtime.totem.pg.mrp.srp.members.973777088.ip=r(0) ip(192.168.10.58) r(1) ip(198.51.100.58) runtime.totem.pg.mrp.srp.members.973777088.join_count=1 runtime.totem.pg.mrp.srp.members.973777088.status=joined 9. Start pacemaker on ha09a 10. Wait approx. 15 seconds. 11. Start pacemaker on ha09b. 12. Wait approx 15 seconds. At this point, the corosync.log file on ha09a was already about 90MB in size and ha09b's was about 70MB. crm_mon showed that no DC had been elected. I tried to stop pacemaker, but it just does the following on both nodes and hangs forever: [root@ha09a log]# service pacemaker stop Signaling Pacemaker Cluster Manager to terminate: [ OK ] Waiting for cluster services to unload:............... So I had to reboot. Here are the logs from both nodes. www.psmnv.com/downloads/corosync_ha09a.log www.psmnv.com/downloads/corosync_ha09b.log -- Eric Robinson Disclaimer - November 13, 2012 This email and any files transmitted with it are confidential and intended solely for General Linux-HA mailing list. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physicians' Managed Care or Physician Select Management. Warning: Although Physicians' Managed Care or Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments. This disclaimer was added by Policy Patrol: http://www.policypatrol.com/ _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
