Re: [Pacemaker] unknown third node added to a 2 node cluster?

2014-10-22 Thread Brian J. Murrell (brian)
On Mon, 2014-10-13 at 12:51 +1100, Andrew Beekhof wrote: > > Even the same address can be a problem. That brief window where things were > getting renewed can screw up corosync. But as I proved, there was no renewal at all during the period of this entire pacemaker run, so the use of DHCP here i

Re: [Pacemaker] unknown third node added to a 2 node cluster?

2014-10-10 Thread Brian J. Murrell (brian)
On Wed, 2014-10-08 at 12:39 +1100, Andrew Beekhof wrote: > On 8 Oct 2014, at 2:09 am, Brian J. Murrell (brian) > wrote: > > > Given a 2 node pacemaker-1.1.10-14.el6_5.3 cluster with nodes "node5" > > and "node6" I saw an "unknown" third nod

[Pacemaker] unknown third node added to a 2 node cluster?

2014-10-07 Thread Brian J. Murrell (brian)
Given a 2 node pacemaker-1.1.10-14.el6_5.3 cluster with nodes "node5" and "node6" I saw an "unknown" third node being added to the cluster, but only on node5: Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 12: memb=2, new=0, lost=

[Pacemaker] another "node rebooting too quickly" bug?

2014-04-24 Thread Brian J. Murrell
Hi, As was previously discussed there is a bug in the handling of a STONITH if a node reboots too quickly. I had a different kind of failure that I suspect is the same kind of problem, just different symptom. The situation is a two node cluster with two resources plus a fencing resource. Each n

Re: [Pacemaker] Node stuck in pending state

2014-04-10 Thread Brian J. Murrell
On Thu, 2014-04-10 at 10:04 +1000, Andrew Beekhof wrote: > > Brian: the detective work above is highly appreciated NP. I feel like I am getting better at reading these logs and can provide some more detailed dissection of them. And am happy to do so to help get to the bottom of things. :-) >

Re: [Pacemaker] Node stuck in pending state

2014-04-09 Thread Brian J. Murrell
On Tue, 2014-04-08 at 17:29 -0400, Digimer wrote: > Looks like your fencing (stonith) failed. Where? If I'm reading the logs correctly, it looks like stonith worked. Here's the stonith: Apr 8 09:53:21 lotus-4vm6 stonith-ng[2492]: notice: log_operation: Operation 'reboot' [3306] (call 2 from

[Pacemaker] Reason for automatic migration after one node rebooted?

2014-02-06 Thread Andrew J. Caines
more detail is needed, then I'll be happy to provide it. [1] http://pastebin.com/raw.php?i=3ThD1uM7 [2] http://pastebin.com/raw.php?i=5F9142SF [3] http://pastebin.com/raw.php?i=LA4E0vUS [4] http://pastebin.com/raw.php?i=6BpB5L4u -- -Andrew J. Caines- Unix Systems Engineer a.j.cai...@halpla

Re: [Pacemaker] error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again

2014-02-06 Thread Brian J. Murrell (brian)
On Thu, 2014-02-06 at 10:42 -0500, Brian J. Murrell (brian) wrote: > On Wed, 2014-01-08 at 13:30 +1100, Andrew Beekhof wrote: > > What version of pacemaker? > > Most recently I have been seeing this in 1.1.10 as shipped by RHEL6.5. Doh! Somebody did a test run that had not been

Re: [Pacemaker] error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again

2014-02-06 Thread Brian J. Murrell (brian)
On Wed, 2014-01-08 at 13:30 +1100, Andrew Beekhof wrote: > What version of pacemaker? Most recently I have been seeing this in 1.1.10 as shipped by RHEL6.5. > On 10 Dec 2013, at 4:40 am, Brian J. Murrell > wrote: > I didn't seem to get a response to any of the below questio

Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-21 Thread Brian J. Murrell (brian)
On Thu, 2014-01-16 at 14:49 +1100, Andrew Beekhof wrote: > > What crm_mon are you looking at? > I see stuff like: > > virt-fencing (stonith:fence_xvm):Started rhos4-node3 > Resource Group: mysql-group > mysql-vip(ocf::heartbeat:IPaddr2): Started rhos4-node3 > mysql

Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-15 Thread Brian J. Murrell (brian)
On Thu, 2014-01-16 at 08:35 +1100, Andrew Beekhof wrote: > > I know, I was giving you another example of when the cib is not completely > up-to-date with reality. Yeah, I understood that. I was just countering with why that example is actually more acceptable. > It may very well be partially s

Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-15 Thread Brian J. Murrell (brian)
On Wed, 2014-01-15 at 17:11 +1100, Andrew Beekhof wrote: > > Consider any long running action, such as starting a database. > We do not update the CIB until after actions have completed, so there can and > will be times when the status section is out of date to one degree or another. But that is

Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-14 Thread Brian J. Murrell (brian)
On Tue, 2014-01-14 at 16:01 +1100, Andrew Beekhof wrote: > > > On Tue, 2014-01-14 at 08:09 +1100, Andrew Beekhof wrote: > >> > >> The local cib hasn't caught up yet by the looks of it. I should have asked in my previous message: is this entirely an artifact of having just restarted or are there

Re: [Pacemaker] crm_resource -L not trustable right after restart

2014-01-13 Thread Brian J. Murrell (brian)
On Tue, 2014-01-14 at 08:09 +1100, Andrew Beekhof wrote: > > The local cib hasn't caught up yet by the looks of it. Should crm_resource actually be [mis-]reporting as if it were knowledgeable when it's not though? IOW is this expected behaviour or should it be considered a bug? Should I open a

[Pacemaker] crm_resource -L not trustable right after restart

2014-01-13 Thread Brian J. Murrell (brian)
Hi, I found a situation using pacemaker 1.1.10 on RHEL6.5 where the output of "crm_resource -L" is not trust-able, shortly after a node is booted. Here is the output from crm_resource -L on one of the nodes in a two node cluster (the one that was not rebooted): st-fencing (stonith:fence_foo

Re: [Pacemaker] does adding a second ring actually work with cman?

2013-12-17 Thread Brian J. Murrell
On Tue, 2013-12-17 at 16:33 +0100, Florian Crouzat wrote: > > Is it possible that lotus-5vm8 (from DNS) and lotus-5vm8-ring1 (from > /etc/hosts) resolves to the same IP (10.128.0.206) which could maybe > confuse cman and make it decide that there is only one ring ? No, they do resolve to two d

[Pacemaker] does adding a second ring actually work with cman?

2013-12-16 Thread Brian J. Murrell
So, I was reading: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/s2-rrp-ccs-CA.html about adding a second ring to one's CMAN configuration. I tried to add a second ring to my configuration without success. Given the example: # ccs -h

[Pacemaker] cman, ccs: Validation Failure, unable to modify configuration file

2013-12-16 Thread Brian J. Murrell
So, trying to create a cluster on a given node with ccs: # ccs -p xxx -h $(hostname) --createcluster foo2 Validation Failure, unable to modify configuration file (use -i to ignore this error). But there shouldn't be any configuration here yet. I've not done anything with this node: # ccs -p xx

Re: [Pacemaker] is ccs as racy as it feels?

2013-12-10 Thread Brian J. Murrell
On Tue, 2013-12-10 at 10:27 +, Christine Caulfield wrote: > > Sadly you're not wrong. That's what I was afraid of. > But it's actually no worse than updating > corosync.conf manually, I think it is... > in fact it's pretty much the same thing, Not really. Updating corosync.conf on any

[Pacemaker] is ccs as racy as it feels?

2013-12-09 Thread Brian J. Murrell
So, I'm trying to wrap my head around this need to migrate to pacemaker +CMAN. I've been looking at http://clusterlabs.org/quickstart-redhat.html and https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/ It seems "ccs" is the tool to configure

Re: [Pacemaker] error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again

2013-12-09 Thread Brian J. Murrell
On Mon, 2013-12-09 at 09:28 +0100, Jan Friesse wrote: > > Error 6 error means "try again". This is happening ether if corosync is > overloaded or creating new membership. Please take a look to > /var/log/cluster/corosync.log if you see something strange there (+ make > sure you have newest corosyn

[Pacemaker] error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again

2013-12-06 Thread Brian J. Murrell (brian)
I seem to have another instance where pacemaker fails to exit at the end of a shutdown. Here's the log from the start of the "service pacemaker stop": Dec 3 13:00:39 wtm-60vm8 crmd[14076]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCES

[Pacemaker] prevent starting resources on failed node

2013-12-06 Thread Brian J. Murrell (brian)
[ Hopefully this doesn't cause a duplicate post but my first attempt returned an error. ] Using pacemaker 1.1.10 (but I think this issue is more general than that release), I want to enforce a policy that once a node fails, no resources can be started/run on it until the user permits it. I have b

Re: [Pacemaker] catch-22: can't fence node A because node A has the fencing resource

2013-12-03 Thread Brian J. Murrell
On Tue, 2013-12-03 at 18:26 -0500, David Vossel wrote: > > We did away with all of the policy engine logic involved with trying to move > fencing devices off of the target node before executing the fencing action. > Behind the scenes all fencing devices are now essentially clones. If the > t

[Pacemaker] catch-22: can't fence node A because node A has the fencing resource

2013-12-02 Thread Brian J. Murrell
So, I'm migrating my working pacemaker configuration from 1.1.7 to 1.1.10 and am finding what appears to be a new behavior in 1.1.10. If a given node is running a fencing resource and that node goes AWOL, it needs to be fenced (of course). But any other node trying to take over the fencing resour

Re: [Pacemaker] Best way to notify stonith action

2013-07-08 Thread Brian J. Murrell
On 13-07-08 03:48 AM, Andreas Mock wrote: Hi all, I'm just wondering what the best way is to let an admin know that the cluster (rest of a cluster) has stonithed some other nodes? You could modify or even just wrap the stonith agent. They are usually just python or shell script anyway (well,

Re: [Pacemaker] error: do_exit: Could not recover from internal error

2013-05-23 Thread Brian J. Murrell
On 13-05-22 07:05 PM, Andrew Beekhof wrote: > > Also, 1.1.8-7 was not tested with the plugin _at_all_ (and neither will > future RHEL builds). Was 1.1.7-* in EL 6.3 tested with the plugin? Is staying with most recent EL 6.3 pacemaker-1.1.7 release really the more stable option for people not a

[Pacemaker] error: do_exit: Could not recover from internal error

2013-05-22 Thread Brian J. Murrell
Using pacemaker 1.1.8-7 on EL6, I got the following series of events trying to shut down pacemaker and then corosync. The corosync shutdown (service corosync stop) ended up spinning/hanging indefinitely (~7hrs now). The events, including a: May 21 23:47:18 node1 crmd[17598]:error: do_exit: C

[Pacemaker] stonith-ng: error: remote_op_done: Operation reboot of node2 by node1 for stonith_admin: Timer expired

2013-05-16 Thread Brian J. Murrell
Using Pacemaker 1.1.8 on EL6.4 with the pacemaker plugin, I'm finding strange behavior with "stonith-admin -B node2". It seems to shut the node down but not start it back up and ends up reporting a timer expired: # stonith_admin -B node2 Command failed: Timer expired The pacemaker log for the op

Re: [Pacemaker] resource starts but then fails right away

2013-05-10 Thread Brian J. Murrell
On 13-05-09 09:53 PM, Andrew Beekhof wrote: > > May 7 02:36:16 node1 crmd[16836]: info: delete_resource: Removing > resource testfs-resource1 for 18002_crm_resource (internal) on node1 > May 7 02:36:16 node1 lrmd: [16833]: info: flush_op: process for operation > monitor[8] on ocf::Target::

[Pacemaker] resource starts but then fails right away

2013-05-09 Thread Brian J. Murrell
Using Pacemaker 1.1.7 on EL6.3, I'm getting an intermittent recurrence of a situation where I add a resource and start it and it seems to start but then right away fail. i.e. # clean up resource before trying to start, just to make sure we start with a clean slate # crm resource cleanup testfs-r

[Pacemaker] warning: unpack_rsc_op: Processing failed op monitor for my_resource on node1: unknown error (1)

2013-04-30 Thread Brian J. Murrell
Using 1.1.8 on EL6.4, I am seeing this sort of thing: pengine[1590]: warning: unpack_rsc_op: Processing failed op monitor for my_resource on node1: unknown error (1) The full log from the point of adding the resource until the errors: Apr 30 11:46:30 node1 cibadmin[3380]: notice: crm_log_arg

Re: [Pacemaker] will a stonith resource be moved from an AWOL node?

2013-04-30 Thread Brian J. Murrell
On 13-04-30 11:13 AM, Lars Marowsky-Bree wrote: > > Pacemaker 1.1.8's stonith/fencing subsystem directly ties into the CIB, > and will complete the fencing request even if the fencing/stonith > resource is not instantiated on the node yet. But clearly that's not happening here. > (There's a bug

[Pacemaker] will a stonith resource be moved from an AWOL node?

2013-04-30 Thread Brian J. Murrell
I'm using pacemaker 1.1.8 and I don't see stonith resources moving away from AWOL hosts like I thought I did with 1.1.7. So I guess the first thing to do is clear up what is supposed to happen. If I have a single stonith resource for a cluster and it's running on node A and then node A goes AWOL,

Re: [Pacemaker] why so long to stonith?

2013-04-24 Thread Brian J. Murrell
On 13-04-24 01:16 AM, Andrew Beekhof wrote: > > Almost certainly you are hitting: > > https://bugzilla.redhat.com/show_bug.cgi?id=951340 Yup. The patch posted there fixed it. > I am doing my best to convince people that make decisions that this is worthy > of an update before 6.5. I've a

[Pacemaker] why so long to stonith?

2013-04-23 Thread Brian J. Murrell
Using pacemaker 1.1.8 on RHEL 6.4, I did a test where I just killed (-KILL) corosync on a peer node. Pacemaker seemed to take a long time to transition to stonithing it though after noticing it was AWOL: Apr 23 19:05:20 node2 corosync[1324]: [TOTEM ] A processor failed, forming new configurati

[Pacemaker] crm_attribute not returning node attribute

2013-04-19 Thread Brian J. Murrell
Given: host1# crm node attribute host1 show foo scope=nodes name=foo value=bar Why doesn't this return anything: host1# crm_attribute --node host1 --name foo --query host1# echo $? 0 cibadmin -Q confirms the presence of the attribute: This is on pac

Re: [Pacemaker] racing crm commands... last write wins?

2013-04-12 Thread Brian J. Murrell
On 13-04-10 07:02 PM, Andrew Beekhof wrote: > > On 11/04/2013, at 6:33 AM, Brian J. Murrell > wrote: >> >> Does crm_resource suffer from this problem > > no Excellent. I was unable to find any comprehensive documentation on just how to implement a pacemake

Re: [Pacemaker] racing crm commands... last write wins?

2013-04-12 Thread Brian J. Murrell
On 13-04-11 06:00 PM, Andrew Beekhof wrote: > > Actually, I think the semantics of -C are first-write-wins and any subsequent > attempts fail with -EEXSIST Indeed, you are correct. I think my point though was that it didn't matter in my case which writer wins since they should all be trying to

Re: [Pacemaker] racing crm commands... last write wins?

2013-04-11 Thread Brian J. Murrell
On 13-04-11 07:37 AM, Brian J. Murrell wrote: > > In exploring all options, how about pcs? Does pcs' "resource create > ..." for example have the same read+modify+replace problem as crm > configure or does pcs resource create also only send proper fragments to > u

Re: [Pacemaker] racing crm commands... last write wins?

2013-04-11 Thread Brian J. Murrell
On 13-04-10 04:33 PM, Brian J. Murrell wrote: > > Does crm_resource suffer from this problem or does it properly only send > exactly the update to the CIB for the operation it's trying to achieve? In exploring all options, how about pcs? Does pcs' "resource create ...&qu

Re: [Pacemaker] racing crm commands... last write wins?

2013-04-10 Thread Brian J. Murrell
On 13-02-21 07:48 PM, Andrew Beekhof wrote: > On Fri, Feb 22, 2013 at 5:18 AM, Brian J. Murrell > wrote: >> I wonder what happens in the case of two racing "crm" commands that want >> to update the CIB (with non-overlapping/conflicting data). Is there any >&

Re: [Pacemaker] Same host displayed twice in crm status

2013-04-02 Thread Nicolas J.
so I don't know where the reference to the old name can be saved except in the cluster. Regarding the version, here are the details: - Corosync 1.2.7-1.1.el5 - Pacemaker 1.1.5-1.1.el5 2013/4/1 David Vossel > - Original Message - > > From: "Nicolas J." > > To: p

[Pacemaker] Same host displayed twice in crm status

2013-03-29 Thread Nicolas J.
s.com INFO: node VMTESTORADG2.it.dbi-services.com not found by crm_node INFO: node VMTESTORADG2.it.dbi-services.com deleted Thanks in advance Best Regards, Nicolas J. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/ma

Re: [Pacemaker] stonith and avoiding split brain in two nodes cluster

2013-03-28 Thread Brian J. Murrell
On 13-03-25 03:50 PM, Jacek Konieczny wrote: > > The first node to notice that the other is unreachable will fence (kill) > the other, making sure it is the only one operating on the shared data. Right. But with typical two-node clusters ignoring no-quorum, because quorum is being ignored, as so

Re: [Pacemaker] racing crm commands... last write wins?

2013-02-25 Thread Brian J. Murrell
On 13-02-25 10:30 AM, Dejan Muhamedagic wrote: > > Before doing replace, crmsh queries the CIB and checks if the > epoch was modified in the meantime. But doesn't take out a lock of any sort to prevent an update in the meanwhile, right? > Those operations are not > atomic, though. Indeed. > Pe

Re: [Pacemaker] a situation where pacemaker refuses to stop

2013-02-25 Thread Brian J. Murrell
On 13-02-24 07:56 PM, Andrew Beekhof wrote: > > Basically yes. > Stonith is the first stage of recovery and supposed to be at least > vaguely reliable. > Have you figured out why fencing is so broken? It wasn't really "broken" but was in the process of being configured when this situation arose.

[Pacemaker] a situation where pacemaker refuses to stop

2013-02-23 Thread Brian J. Murrell
I seem to have found a situation where pacemaker (pacemaker-1.1.7-6.el6.x86_64) refuses to stop (i.e. service pacemaker stop) on EL6. The status of the 2 node cluster was that the node being asked to stop (node2) was continually trying to stonith another node (node1) in the cluster which was not r

[Pacemaker] racing crm commands... last write wins?

2013-02-21 Thread Brian J. Murrell
I wonder what happens in the case of two racing "crm" commands that want to update the CIB (with non-overlapping/conflicting data). Is there any locking to ensure that one crm cannot overwrite the other's change? (i.e. second one to get there has to read in the new CIB before being able to apply h

[Pacemaker] return properties and rsc_defaults back to default values

2013-02-14 Thread Brian J. Murrell
Is there a way to return an individual property (or all properties) and/or a rsc_default (or all) back to default values, using crm, or otherwise? Cheers, b. signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list: Pacemaker@os

[Pacemaker] location constraint "anywhere" on asymmetric cluster

2013-01-30 Thread Brian J. Murrell
I'm experimenting with asymmetric clusters and resource location constraints. My cluster has some resources which have to be restricted to certain nodes and other resources which can run on any node. Given that, an "opt-in" cluster seems the most manageable. That is, it seems easier to create co

Re: [Pacemaker] best/proper way to shut down a node for service

2013-01-23 Thread Brian J. Murrell
On 13-01-23 03:32 AM, Dan Frincu wrote: > Hi, Hi, > I usually put the node in standby, which means it can no longer run > any resources on it. Both Pacemaker and Corosync continue to run, node > provides quorum. But a node in standby will still be STONITHed if it goes AWOL. I put a node in stan

[Pacemaker] best/proper way to shut down a node for service

2013-01-22 Thread Brian J. Murrell
OK. So you have a corosync cluster of nodes with pacemaker managing resources on them, including (of course) STONITH. What's the best/proper way to shut down a node, say, for maintenance such that pacemaker doesn't go trying to "fix" that situation and STONITHing it to try to bring it back up, et

Re: [Pacemaker] NFS resource isn't completely working

2012-10-25 Thread Lonni J Friedman
On Wed, Oct 24, 2012 at 5:59 PM, Andrew Beekhof wrote: > On Wed, Oct 17, 2012 at 8:30 AM, Lonni J Friedman wrote: >> Greetings, >> I'm trying to get an NFS server export to be correctly monitored & >> managed by pacemaker, along with pre-existing IP, drbd and f

Re: [Pacemaker] 'crm configure edit' failed with "Timer expired"

2012-10-18 Thread Lonni J Friedman
ted a new corosync.conf, and now the nodes are talking again. Sorry for the noise. On Thu, Oct 18, 2012 at 10:25 AM, Lonni J Friedman wrote: > Both nodes can ssh to each other, selinux is disabled, and there are > currently no iptables rules in force. So I'm not sure why the

Re: [Pacemaker] 'crm configure edit' failed with "Timer expired"

2012-10-18 Thread Lonni J Friedman
y one would be elected quite quickly, you may have a > network/filewall issue. > > On Thu, Oct 18, 2012 at 10:37 AM, Lonni J Friedman wrote: >> I'm running Fedora17, with pacemaker-1.18. I just tried to make a >> configuration change with crmsh, and it failed as follows:

[Pacemaker] 'crm configure edit' failed with "Timer expired"

2012-10-17 Thread Lonni J Friedman
I'm running Fedora17, with pacemaker-1.18. I just tried to make a configuration change with crmsh, and it failed as follows: ## # crm configure edit Call cib_replace failed (-62): Timer expired ERROR: could not replace cib INFO: offending xml:

[Pacemaker] NFS resource isn't completely working

2012-10-16 Thread Lonni J Friedman
Greetings, I'm trying to get an NFS server export to be correctly monitored & managed by pacemaker, along with pre-existing IP, drbd and filesystem mounts (which are working correctly). While NFS is up on the primary node (along with the other services), the monitoring portion keeps showing up as

Re: [Pacemaker] setting up NFS resources on systemd based Linux distributions

2012-10-16 Thread Lonni J Friedman
On Mon, Oct 15, 2012 at 8:51 PM, Andrew Beekhof wrote: > On Tue, Oct 16, 2012 at 2:50 PM, Andrew Beekhof wrote: >> On Tue, Oct 16, 2012 at 9:24 AM, Lonni J Friedman wrote: >>> On Thu, Sep 27, 2012 at 6:24 AM, David Vossel wrote: >>>> - Original Message -

Re: [Pacemaker] setting up NFS resources on systemd based Linux distributions

2012-10-15 Thread Lonni J Friedman
On Thu, Sep 27, 2012 at 6:24 AM, David Vossel wrote: > - Original Message - >> From: "Lonni J Friedman" >> To: pacemaker@oss.clusterlabs.org >> Sent: Wednesday, September 26, 2012 9:44:21 PM >> Subject: [Pacemaker] setting up NFS resources on s

Re: [Pacemaker] failed over filesystem mount points not coming up on secondary node

2012-10-01 Thread Lonni J Friedman
On Mon, Oct 1, 2012 at 2:14 PM, Jake Smith wrote: > - Original Message - >> From: "Lonni J Friedman" >> To: "The Pacemaker cluster resource manager" >> Sent: Monday, October 1, 2012 4:31:05 PM >> Subject: Re: [Pacemaker] failed over

Re: [Pacemaker] failed over filesystem mount points not coming up on secondary node

2012-10-01 Thread Lonni J Friedman
quot;1.1.7-2.fc16-ee0730e13d124c3d58f00016c3376a1de5323cff" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" ## On Thu, Sep 27, 2012 at 3:10 PM, Lonni

[Pacemaker] 1.1.8 dox?

2012-10-01 Thread Lonni J Friedman
Anyone know where the documentation is for 1.1.8 ? I'm looking here, and everything seems to be months old: http://www.clusterlabs.org/doc/ I keep seeing references to "the shell is gone from 1.1.8", but I can't find any documentation of the impact to a sysadmin, or what the new hotness is to rep

Re: [Pacemaker] setting up NFS resources on systemd based Linux distributions

2012-10-01 Thread Lonni J Friedman
On Sun, Sep 30, 2012 at 7:19 AM, Andrew Beekhof wrote: > On Fri, Sep 28, 2012 at 6:13 AM, Lonni J Friedman wrote: >> On Thu, Sep 27, 2012 at 6:24 AM, David Vossel wrote: >>> - Original Message - >>>> From: "Lonni J Friedman" >>>> To

[Pacemaker] failed over filesystem mount points not coming up on secondary node

2012-09-27 Thread Lonni J Friedman
Greetings, I've just started playing with pacemaker/corosync on a two node setup. At this point I'm just experimenting, and trying to get a good feel of how things work. Eventually I'd like to start using this in a production environment. I'm running Fedora16-x86_64 with pacemaker-1.1.7 & corosy

Re: [Pacemaker] setting up NFS resources on systemd based Linux distributions

2012-09-27 Thread Lonni J Friedman
On Thu, Sep 27, 2012 at 6:24 AM, David Vossel wrote: > - Original Message - >> From: "Lonni J Friedman" >> To: pacemaker@oss.clusterlabs.org >> Sent: Wednesday, September 26, 2012 9:44:21 PM >> Subject: [Pacemaker] setting up NFS resources on s

[Pacemaker] setting up NFS resources on systemd based Linux distributions

2012-09-26 Thread Lonni J Friedman
I'm trying to setup NFS resources on Fedora16, and its not working. After googling, I stumbled across the following discussion from about 8 months ago: http://www.gossamer-threads.com/lists/linuxha/pacemaker/77404 Has anything changed since then, or is systemd still not supported? thanks ___

Re: [Pacemaker] Call cib_query failed (-41): Remote node did not respond

2012-07-04 Thread Brian J. Murrell
On 12-07-04 04:27 AM, Andreas Kurz wrote: > > beside increasing the batch limit to a higher value ... did you also > tune corosync totem timings? Not yet. But a closer look at the logs reveals a bunch of these: Jun 28 14:56:56 node-2 corosync[30497]: [pcmk ] ERROR: send_cluster_msg_raw: Chi

Re: [Pacemaker] Call cib_query failed (-41): Remote node did not respond

2012-07-04 Thread Brian J. Murrell
On 12-07-04 02:12 AM, Andrew Beekhof wrote: > On Wed, Jul 4, 2012 at 10:06 AM, Brian J. Murrell > wrote: >> >> Just because I reduced the number of nodes doesn't mean that I reduced >> the parallelism any. > > Yes. You did. You reduced the number of "che

Re: [Pacemaker] Call cib_query failed (-41): Remote node did not respond

2012-07-03 Thread Brian J. Murrell
On 12-07-03 04:26 PM, David Vossel wrote: > > This is not a definite. Perhaps you are experiencing this given the > pacemaker version you are running Yes, that is absolutely possible and it certainly has been under consideration throughout this process. I did also recognize however, that I am

Re: [Pacemaker] Call cib_query failed (-41): Remote node did not respond

2012-07-03 Thread Brian J. Murrell
On 12-07-03 06:17 PM, Andrew Beekhof wrote: > > Even adding passive nodes multiplies the number of probe operations > that need to be performed and loaded into the cib. So it seems. I just would have not thought they be such a load since from a simplistic perspective, since they are not trying t

Re: [Pacemaker] Call cib_query failed (-41): Remote node did not respond

2012-07-03 Thread Brian J. Murrell
On 12-06-27 11:30 PM, Andrew Beekhof wrote: > > The updates from you aren't the problem. Its the number of resource > operations (that need to be stored in the CIB) that result from your > changes that might be causing the problem. Just to follow this up for anyone currently following or anyone

Re: [Pacemaker] Call cib_query failed (-41): Remote node did not respond

2012-06-27 Thread Brian J. Murrell
On 12-06-26 09:54 PM, Andrew Beekhof wrote: > > The DC, possibly you didn't have one at that moment in time. It was the DC in fact. I restarted corosync on that node and the timeouts went away. But note I "re"started, not started. It was running at the time, just not properly, apparently. > W

[Pacemaker] Call cib_query failed (-41): Remote node did not respond

2012-06-26 Thread Brian J. Murrell
So, I have an 18 node cluster here (so a small haystack, indeed, but still a haystack in which to try to find a needle) where a certain set of (yet unknown, figuring that out is part of this process) operations are pooching pacemaker. The symptom is that on one or more nodes I get the following ki

Re: [Pacemaker] manually failing back resources when set sticky

2012-03-30 Thread Brian J. Murrell
On 12-03-30 02:35 PM, Florian Haas wrote: > > crm configure rsc_defaults resource-stickiness=0 > > ... and then when resources have moved back, set it to 1000 again. > It's really that simple. :) That sounds racy. I am changing a parameter which has the potential to affect the stickiness of all

[Pacemaker] manually failing back resources when set sticky

2012-03-30 Thread Brian J. Murrell
In my cluster configuration, each resource can be run on one of two node and I designate a "primary" and a "secondary" using location constraints such as: location FOO-primary FOO 20: bar1 location FOO-secondary FOO 10: bar2 And I also set a default stickiness to prevent auto-fail-back (i.e. to p

Re: [Pacemaker] resources show as running on all nodes right after adding them

2012-03-28 Thread Brian J. Murrell
On 12-03-28 10:39 AM, Florian Haas wrote: > > Probably because your resource agent reports OCF_SUCCESS on a probe > operation To be clear, is this the "status" $OP in the agent? Cheers, b. signature.asc Description: OpenPGP digital signature ___ Pac

[Pacemaker] resources show as running on all nodes right after adding them

2012-03-28 Thread Brian J. Murrell
We seem to have occasion where we find crm_resource reporting that a resource is running on more (usually all!) nodes when we query right after adding it: # crm_resource -resource chalkfs-OST_3 --locate resource chalkfs-OST_3 is running on: chalk02 resource chalkfs-OST_3 is running on

Re: [Pacemaker] running a resource on any node in an asymmetric cluster

2011-10-26 Thread Brian J. Murrell
On 11-10-26 10:19 AM, Brian J. Murrell wrote: > > # cat /tmp/foo.xml > > ^^^ I figured it out. This "integer" has to be quoted. I'm thinking too much like a programmer. :-/ Cheers, b. signature.asc Description

[Pacemaker] running a resource on any node in an asymmetric cluster

2011-10-26 Thread Brian J. Murrell
I want to be able to run a resource on any node in an asymmetric cluster so I tried creating a rule to run it on any node not named "foo" since there are no nodes named foo in my cluster: # cat /tmp/foo.xml for the resource bar: primitive bar stonith:fence_virsh \ params ipa

[Pacemaker] cloning primatives with differing params

2011-10-25 Thread Brian J. Murrell
I want to create a stonith primitive and clone it for each node in my cluster. I'm using the fence-agents virsh agent as my stonith primitive. Currently for a single node it looks like: primitive st-pm-node1 stonith:fence_virsh \ params ipaddr="192.168.122.1" login="xxx" passwd="xxx" por

Re: [Pacemaker] stonith configured but not happening

2011-10-18 Thread Brian J. Murrell
On 11-10-18 09:40 AM, Andreas Kurz wrote: > Hello, Hi, > I'd expect this to be the problem ... if you insist on using an > unsymmetric cluster you must add a location score for each resource you > want to be up on a node ... so add a location constraint for the fencing > clone for each node ... o

[Pacemaker] stonith configured but not happening

2011-10-18 Thread Brian J. Murrell
I have a pacemaker 1.0.10 installation on rhel5 but I can't seem to manage to get a working stonith configuration. I have tested my stonith device manually using the stonith command and it works fine. What doesn't seem to be happening is pacemaker/stonithd actually asking for a stonith. In my lo

[Pacemaker] concurrent uses of cibadmin: Signon to CIB failed: connection failed

2011-09-29 Thread Brian J. Murrell
So, in another thread there was a discussion of using cibadmin to mitigate possible concurrency issue of crm shell. I have written a test program to test that theory and unfortunately cibadmin falls down in the face of heavy concurrency also with errors such as: Signon to CIB failed: connection f

Re: [Pacemaker] Concurrent runs of 'crm configure primitive' interfering

2011-09-28 Thread Brian J. Murrell
On 11-09-28 10:20 AM, Dejan Muhamedagic wrote: > Hi, Hi, > I'm really not sure. Need to investigate this area more. Well, I am experimenting with cibadmin. It's certainly not as nice and shiny as crm shell though. :-) > cibadmin talks to the cib (the process) and cib should allow > only one w

Re: [Pacemaker] Concurrent runs of 'crm configure primitive' interfering

2011-09-28 Thread Brian J. Murrell
On 11-09-16 11:14 AM, Dejan Muhamedagic wrote: > On Thu, Sep 08, 2011 at 03:41:42PM +0100, John Spray wrote: > >> * Is there another way of adding resources which would be safe when >> run concurrently? > > cibadmin. But doesn't crm use cibadmin itself and if so, shouldn't whatever benefits of

Re: [Pacemaker] Call cib_modify failed (-22): The object/attribute does not exist

2011-09-26 Thread Brian J. Murrell
On 11-09-25 09:21 PM, Andrew Beekhof wrote: > > As the error says, the resource R_10.10.10.101 doesn't exist yet. > Put it in a tag or use -C instead of -U Thanks much. I already replied to Tim, but the summary is that the manpage is incorrect in two places. One is specifying the attributes ta

Re: [Pacemaker] Call cib_modify failed (-22): The object/attribute does not exist

2011-09-26 Thread Brian J. Murrell
On 11-09-26 03:44 AM, Tim Serong wrote: > > Because: > > 1) You need to run "cibadmin -o resources -C -x test.xml" to create the >resource (-C creates, -U updates an existing resource). That's what I thought/wondered but the EXAMPLES section in the manpage is quite clear that it's asking one

[Pacemaker] Call cib_modify failed (-22): The object/attribute does not exist

2011-09-24 Thread Brian J. Murrell
Using pacemaker-1.0.10-1.4.el5 I am trying to add the "R_10.10.10.101" IPaddr2 example resource: from the cibadmin manpage under EXAMPLES and getting: # cibadmin -o resources -U -x test.xml Call cib_modify failed (-22): The object/attribute does not exist Any ideas why? Th

Re: [Pacemaker] resource stickiness and preventing stonith on failback

2011-09-19 Thread Brian J. Murrell
On 11-09-19 11:02 PM, Andrew Beekhof wrote: > On Wed, Aug 24, 2011 at 6:56 AM, Brian J. Murrell > wrote: >> >> 2. preventing the active node from being STONITHed when the resource >> is moved back to it's failed-and-restored node after a failover. >> IO

[Pacemaker] is a single node cluster possible?

2011-08-31 Thread Brian J. Murrell
I have a need to create single node clusters with pacemaker. Crazy you might say. It does seem crazy at first but there are two drivers for this: The first is testing. I want to write a single code path for controlling the starting and stopping of resources in larger, real, multi-node clusters

[Pacemaker] property default-resource-stickiness vs. rsc_defaults resource-stickiness

2011-08-25 Thread Brian J. Murrell
I've seen both of setting a default-resource-stickiness property (i.e. http://www.howtoforge.com/installation-and-setup-guide-for-drbd-openais-pacemaker-xen-on-opensuse-11.1) and a rsc_defaults option with resource-stickiness (http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Sc

[Pacemaker] resource stickiness and preventing stonith on failback

2011-08-23 Thread Brian J. Murrell
Hi All, I am trying to configure pacemaker (1.0.10) to make a single filesystem highly available by two nodes (please don't be distracted by the dangers of multiply mounted filesystems and clustering filesystems, etc., as I am absolutely clear about that -- consider that I am using a filesystem re

Re: [Pacemaker] VirtualDomain/DRBD live migration with pacemaker...

2010-06-15 Thread Dennis J.
On 06/14/2010 11:01 PM, Vadym Chepkov wrote: On Mon, Jun 14, 2010 at 4:37 PM, Erich Weiler wrote: Hi All, We have this interesting problem I was hoping someone could shed some light on. Basically, we have 2 servers acting as a pacemaker cluster for DRBD and VirtualDomain (KVM) resources under

Re: [Pacemaker] how to realize group with colocation?

2010-05-19 Thread Dennis J.
On 05/19/2010 08:59 AM, Andrew Beekhof wrote: > Which part of > > "web_start_0 failed with rc=6: Preventing web from re-starting > anywhere in the cluster" > > Is not clear to you? > > Have a look what rc=6 means: > > http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explain

Re: [Pacemaker] Is it possible for ocf:heartbeat:IPaddr2 to be on different NICs?

2010-04-24 Thread Dennis J.
On 04/23/2010 06:04 PM, Dejan Muhamedagic wrote: ... Apr 23 05:11:11 gamma lrmd: [2663]: info: rsc:apache:4: probe Apr 23 05:11:11 gamma IPaddr2[2678]: ERROR: Setup problem: Couldn't find utility ip Apr 23 05:11:11 gamma crmd: [2666]: info: process_lrm_event: LRM operation ClusterIP_monitor_0 (

Re: [Pacemaker] Dropping HeartBeat Stack?

2010-03-04 Thread Dennis J.
On 03/04/2010 03:37 PM, Andrew Beekhof wrote: On Thu, Mar 4, 2010 at 2:54 PM, Dennis J. wrote: Pacemaker pulls in hearbeat and corosync as dependency. This is what happens on a freshly install centos 5.4 VM: Ah, so I just imagined making that change :-( The next round of packages wont do

Re: [Pacemaker] Dropping HeartBeat Stack?

2010-03-04 Thread Dennis J.
On 03/03/2010 08:09 PM, Andrew Beekhof wrote: On Wed, Mar 3, 2010 at 4:00 PM, Dennis J. wrote: On 03/03/2010 09:24 AM, Andrew Beekhof wrote: On Wed, Mar 3, 2010 at 1:16 AM, Angie T. Muhammad wrote: Hello list I have no technical questions at the moment, just a couple of distribution

Re: [Pacemaker] Dropping HeartBeat Stack?

2010-03-03 Thread Dennis J.
On 03/03/2010 09:24 AM, Andrew Beekhof wrote: On Wed, Mar 3, 2010 at 1:16 AM, Angie T. Muhammad wrote: Hello list I have no technical questions at the moment, just a couple of distribution-specific and backward compatibility questions.. 1- I just wonder will Pacemaker at any time in the near

[Pacemaker] Configuring LVM and Filesystem resources on top of DRBD

2010-02-05 Thread D. J. Draper
I haven't been able to find any documentation outside of the man pages to help troubleshoot this, so I've come to the experts... I'm attempting to setup the following: Services: NFS and Samba Filesystems: /mnt/media | /mnt/datusr

  1   2   >