[Pacemaker] monitor domain controller

2013-05-02 Thread James Harper
Currently I am using a ping resource to ensure that other windows VM's don't start up until the domain controllers are started. This helps prevent things like Exchange not starting properly because they can't find a domain controller. On a complete network shutdown (eg a long power failure), win

Re: [Pacemaker] when is 'not installed' rechecked?

2013-01-02 Thread James Harper
> > On 2012-12-20T05:14:56, James Harper > wrote: > > > I have a resource that returned 'not installed' because (I think) I had > forgotten to install the required package. I've installed the package now but > I > still see the following every 15 min

Re: [Pacemaker] timed out / exec error

2012-12-20 Thread James Harper
> > > > Any cib change throws the system load up for 10-20 seconds, and then > > things start timing out, despite having set the timeouts well in excess of > > the > > time it takes for pacemaker to mark the resource as timed out. > > Hmm, unless your CIB (the configuration) is really huge, that

Re: [Pacemaker] timed out / exec error

2012-12-20 Thread James Harper
> Hi, > > On Tue, Dec 18, 2012 at 10:58:18AM +0000, James Harper wrote: > > For the following failure: > > > > Failed actions: > > p_lvm_iscsi:0_monitor_1 (node=bitvs6, call=57, rc=-2, > > status=Timed Out): unknown exec error > > > >

[Pacemaker] when is 'not installed' rechecked?

2012-12-19 Thread James Harper
I have a resource that returned 'not installed' because (I think) I had forgotten to install the required package. I've installed the package now but I still see the following every 15 minutes: Preventing ocfs2mgmt from re-starting on node1: operation monitor failed 'not installed' (rc=5) As f

Re: [Pacemaker] time synchronisation

2012-12-19 Thread James Harper
> > What is the behaviour of a cluster when the nodes are up to 10 minutes out > of sync with each other, because they've just been booted up after a crash > and the hwclocks are out of date and there is no ntp time source reachable? > Could it cause lots of sig11's and constant re-elections becau

[Pacemaker] time synchronisation

2012-12-19 Thread James Harper
What is the behaviour of a cluster when the nodes are up to 10 minutes out of sync with each other, because they've just been booted up after a crash and the hwclocks are out of date and there is no ntp time source reachable? Could it cause lots of sig11's and constant re-elections because that'

[Pacemaker] timed out / exec error

2012-12-18 Thread James Harper
For the following failure: Failed actions: p_lvm_iscsi:0_monitor_1 (node=bitvs6, call=57, rc=-2, status=Timed Out): unknown exec error Is this the ra itself returning a "Timed Out" error, or is it the cluster software determining that the ra is taking too long and so killing it and dec

[Pacemaker] switch port stonith

2012-10-26 Thread James Harper
Does anyone have an ra that allows a stonith action to disable the ports on a switch for a node? Obviously this would only allow a "shutdown" action not a "reboot" action, but in the absence of anything else it would definitely ensure that the errant node was ejected from the network. I guess re

[Pacemaker] lvm ra timeouts and vgdisplay hang

2012-10-17 Thread James Harper
I've been having a problem with the lvm ra when used in conjunction with clvm when a node dies (eg when I destroy the vm to test this particular scenario) clvm re-organises itself just fine, and comes good well within the lvm ra timeout I set (60 seconds), but if the "vgdisplay -v vg-drbd" comma

[Pacemaker] external/ssh stonith and repeated reboots

2012-10-13 Thread James Harper
I'm using external/ssh in my test cluster (a bunch of vm's), and for some reason the cluster has tried to terminate it but failed, like: ct 14 15:54:45 ctest0 stonith-ng: [2006]: info: call_remote_stonith: Requesting that ctest2 perform op off ctest1 Oct 14 15:54:45 ctest0 stonith-ng: [2006]: in

Re: [Pacemaker] node utilization error

2012-10-13 Thread James Harper
> -Original Message- > From: James Harper [mailto:james.har...@bendigoit.com.au] > Sent: Saturday, 13 October 2012 1:48 AM > To: pacemaker@oss.clusterlabs.org > Subject: [Pacemaker] node utilization error > > When I try and set the utilization on a node I get

[Pacemaker] node utilization error

2012-10-13 Thread James Harper
When I try and set the utilization on a node I get this: # crm node utilization node5 set memory 3 Error setting memory=3 (section=nodes, set=nodes-node5-utilization): Update does not conform to the configured schema/DTD Error performing operation: Update does not conform to the configure

Re: [Pacemaker] high cib load on config change

2012-10-11 Thread James Harper
FWIW, I'm running ocfs2 and looking through the logs a bit more my symptoms seem to match those discussed here - http://www.mentby.com/Group/linux-ha/crmd-31942-warn-decodetransitionkey-bad-uuid-crm-resource-25438-in-sscanf-result-3-for-00crm-resource-25438.html And my test cluster (built on a b

Re: [Pacemaker] high cib load on config change

2012-10-10 Thread James Harper
> > Questions: > - are you making any config changes when this behaviour is occurring? > - if so, from one node only or many? > - what version is this? 1.1.7 or 1.1.7 plus some debian patches? which > patches? > One other thing I just noticed is that ntp wasn't working on the two nodes used fo

Re: [Pacemaker] high cib load on config change

2012-10-10 Thread James Harper
> > I guess I'd first like to know if the log entries I was seeing ("Failed > > application of an update diff" and "Requesting re-sync from peer") means > > that a full resync is being done, and if that's a problem or not. > > There are occasions when its not a problem, but I don't think any of th

Re: [Pacemaker] high cib load on config change

2012-10-10 Thread James Harper
> On 10/09/2012 01:42 PM, James Harper wrote: > > As per previous post, I'm seeing very high cib load whenever I make a > > configuration change, enough load that things timeout seemingly > > instantly. I thought this was happening well before the configured > >

[Pacemaker] high cib load on config change

2012-10-09 Thread James Harper
As per previous post, I'm seeing very high cib load whenever I make a configuration change, enough load that things timeout seemingly instantly. I thought this was happening well before the configured timeout but now I'm not so sure, maybe the timeouts are actually working okay and it just seems

[Pacemaker] cloned resources show as stopped

2012-10-06 Thread James Harper
This is just a cosmetic thing, but I have a cloned resource that shows up like this: Clone Set: c_ping_X [p_ping_X] Started: [ node1 node2 ] Stopped: [ p_ping_X:0 p_ping_X:1 p_ping_X:4 ] That's correct in that the location restriction on the clone set c_ping_X only allows it run on no

Re: [Pacemaker] bug in monitor timeout?

2012-10-04 Thread James Harper
> Hi, > > On Wed, Oct 03, 2012 at 10:07:06PM +0000, James Harper wrote: > > It seems like everytime I modify a resource, things start timing out. Just > > now I changed the location of where a ping resource could run and this > > happened: > > Oct 4 0

Re: [Pacemaker] emergency pppoe connection

2012-10-03 Thread James Harper
> > > > rule 0: ping_gw eq 0 > > > > rule -inf: #uname ne node1 and #name ne node2 > > > > > > > > so the first rule says the resource can run on any location if the > > > > gateway > > > can't be pinged, and the second rule says it can't run on any node > > > except > > > node1 and node2. It's an

Re: [Pacemaker] examine ping result

2012-10-03 Thread James Harper
> - Original Message - > > From: "James Harper" > > To: pacemaker@oss.clusterlabs.org > > Sent: Wednesday, October 3, 2012 5:04:53 PM > > Subject: [Pacemaker] examine ping result > > > > I have a ping resource defined like:

[Pacemaker] bug in monitor timeout?

2012-10-03 Thread James Harper
It seems like everytime I modify a resource, things start timing out. Just now I changed the location of where a ping resource could run and this happened: Oct 4 07:07:07 bitvs5 lrmd: [3681]: WARN: perform_ra_op: the operation monitor[52] on p_lvm_iscsi:0 for client 3686 stayed in operation lis

[Pacemaker] examine ping result

2012-10-03 Thread James Harper
I have a ping resource defined like: primitive p_ping_test ocf:pacemaker:ping \ params name="ping_test" host_list="192.168.200.253" \ op monitor interval="10s" timeout="60s" \ op start interval="0" timeout="60s" How can I examine the result of "ping_test"? Thanks James

Re: [Pacemaker] migration only between nodes with identical hardware

2012-10-03 Thread James Harper
> > Just a thought: You could derive your own RA from the VirtualDomain-RA > > and implement your own migration logic: If target- and > > source-platforms match, do the live migration, if they don't match > > make migrate_to() call > > stop() and migrate_from() call start(). > > We're still accep

Re: [Pacemaker] migration only between nodes with identical hardware

2012-10-02 Thread James Harper
> > On Mon, Oct 1, 2012 at 5:49 PM, James Harper > wrote: > >> > >> On 2012-09-28 16:24, James Harper wrote: > >> > I have two nodes running identical hardware which run Xen VM's, and > >> > want > >> to add a third node to the clu

[Pacemaker] fencing when stonith-enabled="false"

2012-10-01 Thread James Harper
I am trying to configure stonith. ipmilan was a complete disaster as it kept passing port 623 as the hostname, so now I'm trying external/ipmi. The stonith resource is running as expected, but I still have stonith-enabled="false" in the cluster configuration. To test before I turn stonith on p

Re: [Pacemaker] emergency pppoe connection

2012-10-01 Thread James Harper
> > Hi, > > On Sun, Sep 30, 2012 at 05:52:58AM +, James Harper wrote: > > > > > > rule ping_gw eq 0 and (#uname eq bitvs5 or #uname eq bitvs6) > > > > > > and it tells me it doesn't like "and" and "or" in the same

Re: [Pacemaker] migration only between nodes with identical hardware

2012-10-01 Thread James Harper
> > On 2012-09-28 16:24, James Harper wrote: > > I have two nodes running identical hardware which run Xen VM's, and want > to add a third node to the cluster which can access the same clvm and iscsi > resources, but it will not be identical hardware. The non-identical h

[Pacemaker] won't join cluster

2012-09-30 Thread James Harper
I just tried to join two machines to a cluster that were previously members of another cluster. One worked fine, but the other seems to have some problems. "crm status" says everything is offline, but the other nodes all see it as online and resources actually seem to start. It seems to say this

Re: [Pacemaker] emergency pppoe connection

2012-09-29 Thread James Harper
> > rule ping_gw eq 0 and (#uname eq bitvs5 or #uname eq bitvs6) > > and it tells me it doesn't like "and" and "or" in the same expression (or I > think > that's what it's telling me) > > any suggestions? > I think I've figured it out, I need to do it like: rule 0: ping_gw eq 0 rule -inf: #u

[Pacemaker] emergency pppoe connection

2012-09-29 Thread James Harper
My main router is a Xen VM managed by pacemaker. This causes problems when working from home - just lately the Xen resource agent has been a bit temperamental (see other post) and also a new node I'm adding causes clvm to freeze and the network connection is subsequently dropped requiring a driv

[Pacemaker] monitor interval too short

2012-09-29 Thread James Harper
I have a xen resource with op's configured like: op stop interval="0" timeout="300s" \ op migrate_from interval="0" timeout="300s" \ op migrate_to interval="0" timeout="300s" \ op monitor interval="10s" timeout="90s" \ but it seems like every time I make any change

[Pacemaker] migration only between nodes with identical hardware

2012-09-28 Thread James Harper
I have two nodes running identical hardware which run Xen VM's, and want to add a third node to the cluster which can access the same clvm and iscsi resources, but it will not be identical hardware. The non-identical hardware means that to move a VM to this third node it it must be stopped then

[Pacemaker] ordering such that at least one resource started

2012-09-06 Thread James Harper
Further to my last email, I now want to create an ordering rule that says "exchange server should only start when at least one domain controller is already started", but I can't see how to do this using crm. There is a syntax using braces, eg: order dc_then_exch 0: (dc1 dc2) exch but I think t

Re: [Pacemaker] staggered startup

2012-09-06 Thread James Harper
> > On 09/05/2012 04:08 PM, James Harper wrote: > > A power failure tonight indicated that my clustered resources (xen vm's) > have a dependency requirement like "make sure at least one domain > controller VM is fully up and running before starting any other windo

Re: [Pacemaker] staggered startup

2012-09-05 Thread James Harper
> > On Thu, Sep 6, 2012 at 12:08 AM, James Harper > wrote: > > A power failure tonight indicated that my clustered resources (xen vm's) > have a dependency requirement like "make sure at least one domain > controller VM is fully up and running before start

[Pacemaker] staggered startup

2012-09-05 Thread James Harper
A power failure tonight indicated that my clustered resources (xen vm's) have a dependency requirement like "make sure at least one domain controller VM is fully up and running before starting any other windows servers". Determining a status of "fully up and running" is probably complex so as a

Re: [Pacemaker] Join my network on LinkedIn

2012-05-31 Thread James Harper
Attn listadmins - Linkin are more than happy to remove mailing list addresses from their auto-invite-everyone-in-my-contacts thing, you just need to email them and tell them what addresses to exclude. > -Original Message- > From: Kavan Smith via LinkedIn [mailto:mem...@linkedin.com] > Se

Re: [Pacemaker] Invitation to connect on LinkedIn

2012-03-18 Thread James Harper
LinkedIn have a facility to exclude mailing list addresses from the "email everyone in my contacts and tell them about linkedin" function, you just need to tell them what those addresses are. Sounds like a job for the list admins... James > -Original Message- > From: Marcio Ribeiro [mai

[Pacemaker] dependency on start but not on running

2011-12-07 Thread James Harper
I have a pair of servers running Xen, with the xen config files stored on an ocfs2 share mounted on an iscsi volume. A problem has developed where ocfs2 seems to get stuck (in the monitor script I think), and because I have a dependency of xen vm's depending on the volume where the config files are

Re: [Pacemaker] are stopped resources monitored?

2011-11-29 Thread James Harper
> > > > That thread goes around in circles and completely contradicts what I'm > > seeing. What I'm seeing is that unmanaged resources are never monitored. > > would be strange and how do you verify this? A look at your config may also > help to shed some light on this ... > The relevant portion

Re: [Pacemaker] are stopped resources monitored?

2011-11-29 Thread James Harper
> > On 11/29/2011 11:30 AM, James Harper wrote: > > Is pacemaker expected to monitor stopped and/or unmanaged resources? > > I'm guessing not because that's what I've observed but I can't find > > the behaviour documented anywhere so maybe there is a c

[Pacemaker] are stopped resources monitored?

2011-11-29 Thread James Harper
Is pacemaker expected to monitor stopped and/or unmanaged resources? I'm guessing not because that's what I've observed but I can't find the behaviour documented anywhere so maybe there is a config option I can tweak? Thanks James ___ Pacemaker mailin