Ah, /dev/shm had root:root user writable only. Opening it up seems to have kicked something the right way. Thanks folks. ---------------- John White HPC Systems Engineer (510) 486-7307 One Cyclotron Rd, MS: 50C-3209C Lawrence Berkeley National Lab Berkeley, CA 94720
On Apr 11, 2013, at 1:37 PM, John White <jwh...@lbl.gov> wrote: > Yep, we've definitely got /dev/shm (this was done to fix an earlier problem). > ---------------- > John White > HPC Systems Engineer > (510) 486-7307 > One Cyclotron Rd, MS: 50C-3209C > Lawrence Berkeley National Lab > Berkeley, CA 94720 > > On Mar 27, 2013, at 4:46 PM, Andrew Beekhof <and...@beekhof.net> wrote: > >> What about /dev/shm ? >> Libqb tries to create some shared memory in that location by default. >> >> On Thu, Mar 28, 2013 at 8:50 AM, John White <jwh...@lbl.gov> wrote: >>> Yup: >>> -bash-4.1$ cd /var/run/crm/ >>> -bash-4.1$ ls >>> lost+found pcmk pengine st_callback st_command >>> -bash-4.1$ touch blah >>> -bash-4.1$ ls -l >>> total 16 >>> -rw-r--r-- 1 hacluster haclient 0 Mar 27 14:50 blah >>> drwx------ 2 root root 16384 Mar 14 15:00 lost+found >>> srwxrwxrwx 1 root root 0 Mar 22 11:25 pcmk >>> srwxrwxrwx 1 hacluster root 0 Mar 22 11:25 pengine >>> srwxrwxrwx 1 root root 0 Mar 22 11:25 st_callback >>> srwxrwxrwx 1 root root 0 Mar 22 11:25 st_command >>> -bash-4.1$ ls -l /var/run/| grep crm >>> drwxr-xr-x 3 hacluster haclient 4096 Mar 27 14:50 crm >>> -bash-4.1$ whoami >>> hacluster >>> -bash-4.1$ >>> ---------------- >>> John White >>> HPC Systems Engineer >>> (510) 486-7307 >>> One Cyclotron Rd, MS: 50C-3209C >>> Lawrence Berkeley National Lab >>> Berkeley, CA 94720 >>> >>> On Mar 25, 2013, at 4:21 PM, Andreas Kurz <andr...@hastexo.com> wrote: >>> >>>> On 2013-03-22 19:31, John White wrote: >>>>> Hello Folks, >>>>> We're trying to get a corosync/pacemaker instance going on a 4 node >>>>> cluster that boots via pxe. There have been a number of state/file >>>>> system issues, but those appear to be *mostly* taken care of thus far. >>>>> We're running into an issue now where cib just isn't staying up with >>>>> errors akin to the following (sorry for the lengthy dump, note the attrd >>>>> and cib connection errors). Any ideas would be greatly appreciated: >>>>> >>>>> Mar 22 11:25:18 n0014 cib: [25839]: info: validate_with_relaxng: Creating >>>>> RNG parser context >>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: Invoked: >>>>> /usr/lib64/heartbeat/attrd >>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: crm_log_init_worker: Changed >>>>> active directory to /var/lib/heartbeat/cores/hacluster >>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Starting up >>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: get_cluster_type: Cluster >>>>> type is: 'corosync' >>>>> Mar 22 11:25:18 n0014 attrd: [25841]: notice: crm_cluster_connect: >>>>> Connecting to cluster infrastructure: corosync >>>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: init_cpg_connection: Could >>>>> not connect to the Cluster Process Group API: 2 >>>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: HA Signon failed >>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Cluster connection >>>>> active >>>>> Mar 22 11:25:18 n0014 attrd: [25841]: info: main: Accepting attribute >>>>> updates >>>>> Mar 22 11:25:18 n0014 attrd: [25841]: ERROR: main: Aborting startup >>>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: Invoked: >>>>> /usr/lib64/heartbeat/pengine >>>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: crm_log_init_worker: >>>>> Changed active directory to /var/lib/heartbeat/cores/hacluster >>>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Checking for old >>>>> instances of pengine >>>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: >>>>> init_client_ipc_comms_nodispatch: Attempting to talk on: >>>>> /var/run/crm/pengine >>>> >>>> That "/var/run/crm" directory is available and owned by >>>> hacluster.haclient ... and writable by at least the hacluster user? >>>> >>>> Regards, >>>> Andreas >>>> >>>> -- >>>> Need help with Pacemaker? >>>> http://www.hastexo.com/now >>>> >>>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: ERROR: pcmk_child_exit: Child >>>>> process attrd exited (pid=25841, rc=100) >>>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: notice: pcmk_child_exit: Child >>>>> process attrd no longer wishes to be respawned >>>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: info: update_node_processes: >>>>> Node n0014.lustre now has process list: 00000000000000000000000000110312 >>>>> (was 00000000000000000000000000111312) >>>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: >>>>> init_client_ipc_comms_nodispatch: Could not init comms on: >>>>> /var/run/crm/pengine >>>>> Mar 22 11:25:18 n0014 pengine: [25842]: debug: main: Init server comms >>>>> Mar 22 11:25:18 n0014 pengine: [25842]: info: main: Starting pengine >>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: init_cpg_connection: >>>>> Adding fd=4 to mainloop >>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: >>>>> init_ais_connection_once: Connection to 'corosync': established >>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: crm_new_peer: Creating >>>>> entry for node n0014.lustre/247988234 >>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node >>>>> n0014.lustre now has id: 247988234 >>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_new_peer: Node >>>>> 247988234 is now known as n0014.lustre >>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: debug: >>>>> init_client_ipc_comms_nodispatch: Attempting to talk on: /var/run/crm/pcmk >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: Invoked: >>>>> /usr/lib64/heartbeat/crmd >>>>> Mar 22 11:25:18 n0014 pacemakerd: [25834]: debug: pcmk_client_connect: >>>>> Channel 0x995530 connected: 1 children >>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: main: Starting >>>>> stonith-ng mainloop >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: crm_log_init_worker: Changed >>>>> active directory to /var/lib/heartbeat/cores/hacluster >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: main: CRM Hg Version: >>>>> a02c0f19a00c1eb2527ad38f146ebc0834814558 >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: crmd_init: Starting crmd >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: s_crmd_fsa: Processing >>>>> I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ] >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: >>>>> #011// A_LOG >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: >>>>> #011// A_STARTUP >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Registering >>>>> Signal Handlers >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_startup: Creating CIB and >>>>> LRM objects >>>>> Mar 22 11:25:18 n0014 stonith-ng: [25838]: info: crm_update_peer: Node >>>>> n0014.lustre: id=247988234 state=unknown addr=(null) votes=0 born=0 >>>>> seen=0 proc=00000000000000000000000000110312 (new) >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: info: G_main_add_SignalHandler: >>>>> Added signal handler for signal 17 >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: do_fsa_action: actions:trace: >>>>> #011// A_CIB_START >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: >>>>> init_client_ipc_comms_nodispatch: Attempting to talk on: >>>>> /var/run/crm/cib_rw >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: >>>>> init_client_ipc_comms_nodispatch: Could not init comms on: >>>>> /var/run/crm/cib_rw >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: >>>>> Connection to command channel failed >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: >>>>> init_client_ipc_comms_nodispatch: Attempting to talk on: >>>>> /var/run/crm/cib_callback >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: >>>>> init_client_ipc_comms_nodispatch: Could not init comms on: >>>>> /var/run/crm/cib_callback >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: >>>>> Connection to callback channel failed >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signon_raw: >>>>> Connection to CIB failed: connection failed >>>>> Mar 22 11:25:18 n0014 crmd: [25843]: debug: cib_native_signoff: Signing >>>>> out of the CIB Service >>>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: Element cib failed to validate >>>>> content >>>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: readCibXmlFile: CIB does not >>>>> validate with <null> >>>>> Mar 22 11:25:18 n0014 cib: [25839]: info: startCib: CIB Initialization >>>>> completed successfully >>>>> Mar 22 11:25:18 n0014 cib: [25839]: info: get_cluster_type: Cluster type >>>>> is: 'corosync' >>>>> Mar 22 11:25:18 n0014 cib: [25839]: notice: crm_cluster_connect: >>>>> Connecting to cluster infrastructure: corosync >>>>> Mar 22 11:25:18 n0014 cib: [25839]: ERROR: init_cpg_connection: Could not >>>>> connect to the Cluster Process Group API: 2 >>>>> Mar 22 11:25:18 n0014 cib: [25839]: CRIT: cib_init: Cannot sign in to the >>>>> cluster... terminating >>>>> >>>>> >>>>> ---------------- >>>>> John White >>>>> HPC Systems Engineer >>>>> (510) 486-7307 >>>>> One Cyclotron Rd, MS: 50C-3209C >>>>> Lawrence Berkeley National Lab >>>>> Berkeley, CA 94720 >>>>> >>>>> >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org