Parshvi <parshvi.17@...> writes: > > Hi, > We are upgrading to Pacemaker 1.1.7 and Corosync 1.4.3. > The previous version was: > Pacemaker: 1.0.12 > Corosync : 1.2.7 > The issues faced in the older version are: > 1) Numerous, Policy engine and crmd crashes, stopping failed cluster > resources > from recovering. > 2) pacemaker logs show FSM in pending state, service comes in sync only after a > restart. > > Environment: > 1) OS: OEL 5.8 > RPMS(packages) for Pacemaker 1.1.7, Corosync 1.4.3 and other dependent pkgs are > not available for OEL 5.8. Hence, we have build all pkgs from source (github). > > We have a two node cluster. We have installed the build binaries on both cluster > nodes. crm_mon shows both nodes as online. All processes of corosync and > pacemaker appear started and running. > > Issues faced: > We have another setup, consisting of two nodes in the cluster(same as above). > Pkg binaries have been installed on both the nodes. > One of the nodes appears UNCLEAN (offline) and other node appears (offline). > crmd process continuously respawns until its max respawn count is reached. DC > appears NONE in crm_mon. > > I have checked selinux, firewall on the nodes(its disabled). > > I have an hb_report of the nodes. I can share it if needed. > > I also created another cluster of 2 nodes: One node was from WORKING cluster and > another node was from NON_WORKING cluster. > A dump of the o/p of crm_mon of such a cluster is: > > Last updated: Sat Nov 17 19:53:37 2012 > Last change: Sat Nov 17 19:53:27 2012 via crmd on node-112 > Stack: openais > Current DC: node-112 - partition with quorum > Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff > 2 Nodes configured, 2 expected votes > 0 Resources configured. > ============ > > Node node-122: UNCLEAN (offline) > Online: [ node-112 ] > > After some time the UNCLEAN(offline) node appears offline: > > Last updated: Sat Nov 17 20:26:48 2012 > Last change: Sat Nov 17 20:15:38 2012 via cibadmin on node-112 > Stack: openais > Current DC: node-112 - partition with quorum > Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff > 2 Nodes configured, 2 expected votes > 0 Resources configured. > ============ > > Online: [ node-112 ] > OFFLINE: [ node-122 ] > > I would request the owners to please respond with some input. The old version is > a concern at our production.
A dump of following commands on the node appearing UNCLEAN(offline) is: corosync-objctl | grep member runtime.totem.pg.mrp.srp.members.1887545536.ip=r(0) ip(192.168.100.112) runtime.totem.pg.mrp.srp.members.1887545536.join_count=1 runtime.totem.pg.mrp.srp.members.1887545536.status=joined runtime.totem.pg.mrp.srp.members.2055317696.ip=r(0) ip(192.168.100.122) runtime.totem.pg.mrp.srp.members.2055317696.join_count=1 runtime.totem.pg.mrp.srp.members.2055317696.status=joined corosync-cfgtool -s Printing ring status. Local node ID 2055317696 RING ID 0 id = 192.168.100.122 status = ring 0 active with no faults _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org