Hello Folks,
        We're trying to get a corosync/pacemaker instance going on a 4 node 
cluster that boots via pxe.  There have been a number of state/file system 
issues, but those appear to be *mostly* taken care of thus far.  We're running 
into an issue now where cib just isn't staying up with errors akin to the 
following (sorry for the lengthy dump, note the attrd and cib connection 
errors).  Any ideas would be greatly appreciated: 

Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating RNG 
parser context
Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: /usr/lib64/heartbeat/attrd 
Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed active 
directory to /var/lib/heartbeat/cores/hacluster
Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type is: 
'corosync'
Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: Connecting 
to cluster infrastructure: corosync
Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not 
connect to the Cluster Process Group API: 2
Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active
Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute updates
Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: 
/usr/lib64/heartbeat/pengine 
Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed 
active directory to /var/lib/heartbeat/cores/hacluster
Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old instances 
of pengine
Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pengine
Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child 
process attrd exited (pid=25841, rc=100)
Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child 
process attrd no longer wishes to be respawned
Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: Node 
n0014.lustre now has process list: 00000000000000000000000000110312 (was 
00000000000000000000000000111312)
Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
init_client_ipc_comms_nodispatch: Could not init comms on: /var/run/crm/pengine
Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms
Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine
Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: Adding 
fd=4 to mainloop
Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: 
Connection to 'corosync': established
Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating entry 
for node n0014.lustre/247988234
Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 
n0014.lustre now has id: 247988234
Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 247988234 
is now known as n0014.lustre
Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: 
init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk
Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: /usr/lib64/heartbeat/crmd 
Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: Channel 
0x995530 connected: 1 children
Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng 
mainloop
Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed active 
directory to /var/lib/heartbeat/cores/hacluster
Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: 
a02c0f19a00c1eb2527ad38f146ebc0834814558
Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd
Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing I_STARTUP: [ 
state=S_STARTING cause=C_STARTUP origin=crmd_init ]
Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
#011// A_LOG   
Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
#011// A_STARTUP
Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering Signal 
Handlers
Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and LRM 
objects
Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node 
n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 seen=0 
proc=00000000000000000000000000110312 (new)
Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: Added 
signal handler for signal 17
Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
#011// A_CIB_START
Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: 
Attempting to talk on: /var/run/crm/cib_rw
Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: 
Could not init comms on: /var/run/crm/cib_rw
Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection 
to command channel failed
Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: 
Attempting to talk on: /var/run/crm/cib_callback
Mar 22 11:25:18 n0014 crmd: [25843]: debug: init_client_ipc_comms_nodispatch: 
Could not init comms on: /var/run/crm/cib_callback
Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection 
to callback channel failed
Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: Connection 
to CIB failed: connection failed
Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signoff: Signing out of 
the CIB Service
Mar 22 11:25:18 n0014 cib: [25839]: ERROR: Element cib failed to validate 
content
Mar 22 11:25:18 n0014 cib: [25839]: ERROR: readCibXmlFile: CIB does not 
validate with <null>
Mar 22 11:25:18 n0014 cib: [25839]: info: startCib: CIB Initialization 
completed successfully
Mar 22 11:25:18 n0014 cib: [25839]: info: get_cluster_type: Cluster type is: 
'corosync'
Mar 22 11:25:18 n0014 cib: [25839]: notice: crm_cluster_connect: Connecting to 
cluster infrastructure: corosync
Mar 22 11:25:18 n0014 cib: [25839]: ERROR: init_cpg_connection: Could not 
connect to the Cluster Process Group API: 2
Mar 22 11:25:18 n0014 cib: [25839]: CRIT: cib_init: Cannot sign in to the 
cluster... terminating


----------------
John White
HPC Systems Engineer
(510) 486-7307
One Cyclotron Rd, MS: 50C-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to