Hello Folks, We're trying to get a corosync/pacemaker instance going on a 4 node cluster that boots via pxe. There have been a number of state/file system issues, but those appear to be *mostly* taken care of thus far. We're running into an issue now where cib just isn't staying up with errors akin to the following (sorry for the lengthy dump, note the attrd and cib connection errors). Any ideas would be greatly appreciated:
Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating RNG parser context Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: /usr/lib64/heartbeat/attrd Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type is: 'corosync' Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not connect to the Cluster Process Group API: 2 Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: /usr/lib64/heartbeat/pengine Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old instances of pengine Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child process attrd exited (pid=25841, rc=100) Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child process attrd no longer wishes to be respawned Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node n0014.lustre now has process list: 00000000000000000000000000110312 (was 00000000000000000000000000111312) Mar 22 11:25:18 n0014 pengine: [25842]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/pengine Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: Adding fd=4 to mainloop Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: Connection to 'corosync': established Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating entry for node n0014.lustre/247988234 Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node n0014.lustre now has id: 247988234 Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 247988234 is now known as n0014.lustre Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: /usr/lib64/heartbeat/crmd Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: Channel 0x995530 connected: 1 children Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng mainloop Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/hacluster Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: a02c0f19a00c1eb2527ad38f146ebc0834814558 Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ] Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_LOG Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_STARTUP Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering Signal Handlers Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and LRM objects Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 seen=0 proc=00000000000000000000000000110312 (new) Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: #011// A_CIB_START Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/cib_rw Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/cib_rw Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection to command channel failed Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/cib_callback Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/cib_callback Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection to callback channel failed Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection to CIB failed: connection failed Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signoff: Signing out of the CIB Service Mar 22 11:25:18 n0014 cib: [25839]: ERROR: Element cib failed to validate content Mar 22 11:25:18 n0014 cib: [25839]: ERROR: readCibXmlFile: CIB does not validate with <null> Mar 22 11:25:18 n0014 cib: [25839]: info: startCib: CIB Initialization completed successfully Mar 22 11:25:18 n0014 cib: [25839]: info: get_cluster_type: Cluster type is: 'corosync' Mar 22 11:25:18 n0014 cib: [25839]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Mar 22 11:25:18 n0014 cib: [25839]: ERROR: init_cpg_connection: Could not connect to the Cluster Process Group API: 2 Mar 22 11:25:18 n0014 cib: [25839]: CRIT: cib_init: Cannot sign in to the cluster... terminating ---------------- John White HPC Systems Engineer (510) 486-7307 One Cyclotron Rd, MS: 50C-3209C Lawrence Berkeley National Lab Berkeley, CA 94720 _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org