Ah, /dev/shm had root:root user writable only.  Opening it up seems to have 
kicked something the right way.  Thanks folks.
----------------
John White
HPC Systems Engineer
(510) 486-7307
One Cyclotron Rd, MS: 50C-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720

On Apr 11, 2013, at 1:37 PM, John White <jwh...@lbl.gov> wrote:

> Yep, we've definitely got /dev/shm (this was done to fix an earlier problem).
> ----------------
> John White
> HPC Systems Engineer
> (510) 486-7307
> One Cyclotron Rd, MS: 50C-3209C
> Lawrence Berkeley National Lab
> Berkeley, CA 94720
> 
> On Mar 27, 2013, at 4:46 PM, Andrew Beekhof <and...@beekhof.net> wrote:
> 
>> What about /dev/shm ?
>> Libqb tries to create some shared memory in that location by default.
>> 
>> On Thu, Mar 28, 2013 at 8:50 AM, John White <jwh...@lbl.gov> wrote:
>>> Yup:
>>> -bash-4.1$ cd /var/run/crm/
>>> -bash-4.1$ ls
>>> lost+found  pcmk  pengine  st_callback  st_command
>>> -bash-4.1$ touch blah
>>> -bash-4.1$ ls -l
>>> total 16
>>> -rw-r--r-- 1 hacluster haclient     0 Mar 27 14:50 blah
>>> drwx------ 2 root      root     16384 Mar 14 15:00 lost+found
>>> srwxrwxrwx 1 root      root         0 Mar 22 11:25 pcmk
>>> srwxrwxrwx 1 hacluster root         0 Mar 22 11:25 pengine
>>> srwxrwxrwx 1 root      root         0 Mar 22 11:25 st_callback
>>> srwxrwxrwx 1 root      root         0 Mar 22 11:25 st_command
>>> -bash-4.1$ ls -l /var/run/| grep crm
>>> drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm
>>> -bash-4.1$ whoami
>>> hacluster
>>> -bash-4.1$
>>> ----------------
>>> John White
>>> HPC Systems Engineer
>>> (510) 486-7307
>>> One Cyclotron Rd, MS: 50C-3209C
>>> Lawrence Berkeley National Lab
>>> Berkeley, CA 94720
>>> 
>>> On Mar 25, 2013, at 4:21 PM, Andreas Kurz <andr...@hastexo.com> wrote:
>>> 
>>>> On 2013-03-22 19:31, John White wrote:
>>>>> Hello Folks,
>>>>>    We're trying to get a corosync/pacemaker instance going on a 4 node 
>>>>> cluster that boots via pxe.  There have been a number of state/file 
>>>>> system issues, but those appear to be *mostly* taken care of thus far.  
>>>>> We're running into an issue now where cib just isn't staying up with 
>>>>> errors akin to the following (sorry for the lengthy dump, note the attrd 
>>>>> and cib connection errors).  Any ideas would be greatly appreciated:
>>>>> 
>>>>> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating 
>>>>> RNG parser context
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: 
>>>>> /usr/lib64/heartbeat/attrd
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed 
>>>>> active directory to /var/lib/heartbeat/cores/hacluster
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster 
>>>>> type is: 'corosync'
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: 
>>>>> Connecting to cluster infrastructure: corosync
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could 
>>>>> not connect to the Cluster Process Group API: 2
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection 
>>>>> active
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute 
>>>>> updates
>>>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup
>>>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: 
>>>>> /usr/lib64/heartbeat/pengine
>>>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: 
>>>>> Changed active directory to /var/lib/heartbeat/cores/hacluster
>>>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old 
>>>>> instances of pengine
>>>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
>>>>> init_client_ipc_comms_nodispatch: Attempting to talk on: 
>>>>> /var/run/crm/pengine
>>>> 
>>>> That "/var/run/crm" directory is available and owned by
>>>> hacluster.haclient ... and writable by at least the hacluster user?
>>>> 
>>>> Regards,
>>>> Andreas
>>>> 
>>>> --
>>>> Need help with Pacemaker?
>>>> http://www.hastexo.com/now
>>>> 
>>>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child 
>>>>> process attrd exited (pid=25841, rc=100)
>>>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child 
>>>>> process attrd no longer wishes to be respawned
>>>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: 
>>>>> Node n0014.lustre now has process list: 00000000000000000000000000110312 
>>>>> (was 00000000000000000000000000111312)
>>>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: 
>>>>> init_client_ipc_comms_nodispatch: Could not init comms on: 
>>>>> /var/run/crm/pengine
>>>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms
>>>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine
>>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: 
>>>>> Adding fd=4 to mainloop
>>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: 
>>>>> init_ais_connection_once: Connection to 'corosync': established
>>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating 
>>>>> entry for node n0014.lustre/247988234
>>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 
>>>>> n0014.lustre now has id: 247988234
>>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node 
>>>>> 247988234 is now known as n0014.lustre
>>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: 
>>>>> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: 
>>>>> /usr/lib64/heartbeat/crmd
>>>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: 
>>>>> Channel 0x995530 connected: 1 children
>>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting 
>>>>> stonith-ng mainloop
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed 
>>>>> active directory to /var/lib/heartbeat/cores/hacluster
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: 
>>>>> a02c0f19a00c1eb2527ad38f146ebc0834814558
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing 
>>>>> I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ]
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
>>>>> #011// A_LOG
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
>>>>> #011// A_STARTUP
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering 
>>>>> Signal Handlers
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and 
>>>>> LRM objects
>>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node 
>>>>> n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 
>>>>> seen=0 proc=00000000000000000000000000110312 (new)
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: 
>>>>> Added signal handler for signal 17
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: 
>>>>> #011// A_CIB_START
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: 
>>>>> init_client_ipc_comms_nodispatch: Attempting to talk on: 
>>>>> /var/run/crm/cib_rw
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: 
>>>>> init_client_ipc_comms_nodispatch: Could not init comms on: 
>>>>> /var/run/crm/cib_rw
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: 
>>>>> Connection to command channel failed
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: 
>>>>> init_client_ipc_comms_nodispatch: Attempting to talk on: 
>>>>> /var/run/crm/cib_callback
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: 
>>>>> init_client_ipc_comms_nodispatch: Could not init comms on: 
>>>>> /var/run/crm/cib_callback
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: 
>>>>> Connection to callback channel failed
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: 
>>>>> Connection to CIB failed: connection failed
>>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signoff: Signing 
>>>>> out of the CIB Service
>>>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: Element cib failed to validate 
>>>>> content
>>>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: readCibXmlFile: CIB does not 
>>>>> validate with <null>
>>>>> Mar 22 11:25:18 n0014 cib: [25839]: info: startCib: CIB Initialization 
>>>>> completed successfully
>>>>> Mar 22 11:25:18 n0014 cib: [25839]: info: get_cluster_type: Cluster type 
>>>>> is: 'corosync'
>>>>> Mar 22 11:25:18 n0014 cib: [25839]: notice: crm_cluster_connect: 
>>>>> Connecting to cluster infrastructure: corosync
>>>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: init_cpg_connection: Could not 
>>>>> connect to the Cluster Process Group API: 2
>>>>> Mar 22 11:25:18 n0014 cib: [25839]: CRIT: cib_init: Cannot sign in to the 
>>>>> cluster... terminating
>>>>> 
>>>>> 
>>>>> ----------------
>>>>> John White
>>>>> HPC Systems Engineer
>>>>> (510) 486-7307
>>>>> One Cyclotron Rd, MS: 50C-3209C
>>>>> Lawrence Berkeley National Lab
>>>>> Berkeley, CA 94720
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>> 
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> 
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to