Re: [Pacemaker] issues when installing on pxe booted environment

Andrew Beekhof Wed, 27 Mar 2013 16:49:02 -0700

What about /dev/shm ?
Libqb tries to create some shared memory in that location by default.


On Thu, Mar 28, 2013 at 8:50 AM, John White <jwh...@lbl.gov> wrote:
> Yup:
> -bash-4.1$ cd /var/run/crm/
> -bash-4.1$ ls
> lost+found  pcmk  pengine  st_callback  st_command
> -bash-4.1$ touch blah
> -bash-4.1$ ls -l
> total 16
> -rw-r--r-- 1 hacluster haclient     0 Mar 27 14:50 blah
> drwx------ 2 root      root     16384 Mar 14 15:00 lost+found
> srwxrwxrwx 1 root      root         0 Mar 22 11:25 pcmk
> srwxrwxrwx 1 hacluster root         0 Mar 22 11:25 pengine
> srwxrwxrwx 1 root      root         0 Mar 22 11:25 st_callback
> srwxrwxrwx 1 root      root         0 Mar 22 11:25 st_command
> -bash-4.1$ ls -l /var/run/| grep crm
> drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm
> -bash-4.1$ whoami
> hacluster
> -bash-4.1$
> ----------------
> John White
> HPC Systems Engineer
> (510) 486-7307
> One Cyclotron Rd, MS: 50C-3209C
> Lawrence Berkeley National Lab
> Berkeley, CA 94720
>
> On Mar 25, 2013, at 4:21 PM, Andreas Kurz <andr...@hastexo.com> wrote:
>
>> On 2013-03-22 19:31, John White wrote:
>>> Hello Folks,
>>>      We're trying to get a corosync/pacemaker instance going on a 4 node 
>>> cluster that boots via pxe.  There have been a number of state/file system 
>>> issues, but those appear to be *mostly* taken care of thus far.  We're 
>>> running into an issue now where cib just isn't staying up with errors akin 
>>> to the following (sorry for the lengthy dump, note the attrd and cib 
>>> connection errors).  Any ideas would be greatly appreciated:
>>>
>>> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating 
>>> RNG parser context
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: 
>>> /usr/lib64/heartbeat/attrd
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed 
>>> active directory to /var/lib/heartbeat/cores/hacluster
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster type 
>>> is: 'corosync'
>>> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: 
>>> Connecting to cluster infrastructure: corosync
>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could not 
>>> connect to the Cluster Process Group API: 2
>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection active
>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute 
>>> updates
>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: 
>>> /usr/lib64/heartbeat/pengine
>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: Changed 
>>> active directory to /var/lib/heartbeat/cores/hacluster
>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old 
>>> instances of pengine
>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
>>> init_client_ipc_comms_nodispatch: Attempting to talk on: 
>>> /var/run/crm/pengine
>>
>> That "/var/run/crm" directory is available and owned by
>> hacluster.haclient ... and writable by at least the hacluster user?
>>
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child 
>>> process attrd exited (pid=25841, rc=100)
>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child 
>>> process attrd no longer wishes to be respawned
>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: 
>>> Node n0014.lustre now has process list: 00000000000000000000000000110312 
>>> (was 00000000000000000000000000111312)
>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
>>> init_client_ipc_comms_nodispatch: Could not init comms on: 
>>> /var/run/crm/pengine
>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms
>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: 
>>> Adding fd=4 to mainloop
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: init_ais_connection_once: 
>>> Connection to 'corosync': established
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating 
>>> entry for node n0014.lustre/247988234
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 
>>> n0014.lustre now has id: 247988234
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 
>>> 247988234 is now known as n0014.lustre
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: 
>>> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk
>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: 
>>> /usr/lib64/heartbeat/crmd
>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: 
>>> Channel 0x995530 connected: 1 children
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting stonith-ng 
>>> mainloop
>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed 
>>> active directory to /var/lib/heartbeat/cores/hacluster
>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: 
>>> a02c0f19a00c1eb2527ad38f146ebc0834814558
>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing 
>>> I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ]
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
>>> #011// A_LOG
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
>>> #011// A_STARTUP
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering Signal 
>>> Handlers
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and 
>>> LRM objects
>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node 
>>> n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 seen=0 
>>> proc=00000000000000000000000000110312 (new)
>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: Added 
>>> signal handler for signal 17
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
>>> #011// A_CIB_START
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: 
>>> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/cib_rw
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: 
>>> init_client_ipc_comms_nodispatch: Could not init comms on: 
>>> /var/run/crm/cib_rw
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: 
>>> Connection to command channel failed
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: 
>>> init_client_ipc_comms_nodispatch: Attempting to talk on: 
>>> /var/run/crm/cib_callback
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: 
>>> init_client_ipc_comms_nodispatch: Could not init comms on: 
>>> /var/run/crm/cib_callback
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: 
>>> Connection to callback channel failed
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: 
>>> Connection to CIB failed: connection failed
>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signoff: Signing out 
>>> of the CIB Service
>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: Element cib failed to validate 
>>> content
>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: readCibXmlFile: CIB does not 
>>> validate with <null>
>>> Mar 22 11:25:18 n0014 cib: [25839]: info: startCib: CIB Initialization 
>>> completed successfully
>>> Mar 22 11:25:18 n0014 cib: [25839]: info: get_cluster_type: Cluster type 
>>> is: 'corosync'
>>> Mar 22 11:25:18 n0014 cib: [25839]: notice: crm_cluster_connect: Connecting 
>>> to cluster infrastructure: corosync
>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: init_cpg_connection: Could not 
>>> connect to the Cluster Process Group API: 2
>>> Mar 22 11:25:18 n0014 cib: [25839]: CRIT: cib_init: Cannot sign in to the 
>>> cluster... terminating
>>>
>>>
>>> ----------------
>>> John White
>>> HPC Systems Engineer
>>> (510) 486-7307
>>> One Cyclotron Rd, MS: 50C-3209C
>>> Lawrence Berkeley National Lab
>>> Berkeley, CA 94720
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] issues when installing on pxe booted environment

Reply via email to