Re: [Pacemaker] crm_mon on Node-2 shows both Node-1 & Node-2 as online but crm_mon on Node-1 shows Node-2 as offline

2012-04-19 Thread Vladislav Bogdanov
20.04.2012 03:09, Andrew Beekhof wrote: > On Thu, Apr 19, 2012 at 11:51 PM, Dan Frincu wrote: >> Hi, >> >> On Thu, Apr 19, 2012 at 3:56 PM, Parshvi wrote: >>> 1) What is the use of ssh without pass key between cluster nodes in >>> pacemaker ? >>> a. Use case: >>>i. Two nodes in a cluster (C

[Pacemaker] Filesystem RA would cause "stop timeout" if it mounts shared storage

2012-04-19 Thread Junko IKEDA
Hi, I found the following event; 1) Mount shared storage drive using Filesystem RA. 2) Unlink the Fibre Channel, so RA can not access the shared storage for now. 3) RA detects the monitor failure and calls Filesystem_stop(). 4) Filesystem_stop() goes into a timeout error. The current Filesystem

Re: [Pacemaker] Convenience Groups - WAS Re: [Linux-HA] Unordered groups (was Re: Is 'resource_set' still experimental?)

2012-04-19 Thread Andrew Beekhof
On Fri, Apr 20, 2012 at 7:41 AM, Vladislav Bogdanov wrote: > 19.04.2012 20:48, David Vossel wrote: >> - Original Message - >>> From: "Alan Robertson" >>> To: pacemaker@oss.clusterlabs.org, "Andrew Beekhof" >>> Cc: "Dejan Muhamedagic" >>> Sent: Thursday, April 19, 2012 10:22:48 AM >>> Su

Re: [Pacemaker] crm_mon on Node-2 shows both Node-1 & Node-2 as online but crm_mon on Node-1 shows Node-2 as offline

2012-04-19 Thread Andrew Beekhof
On Thu, Apr 19, 2012 at 11:51 PM, Dan Frincu wrote: > Hi, > > On Thu, Apr 19, 2012 at 3:56 PM, Parshvi wrote: >> 1) What is the use of ssh without pass key between cluster nodes in >> pacemaker ? >>  a. Use case: >>    i. Two nodes in a cluster (Call them Node-1 and Node-2) >>    ii. One interfa

Re: [Pacemaker] OCF Resource agent monitor activity failed due to temporary error

2012-04-19 Thread Andrew Beekhof
Neither the cluster manager nor the RA can know that the error is temporary. You can only know that with the benefit of hindsight. So what you're asking for is that the cluster ignores the first N errors... which doesn't sound very "HA". The better approach is write the RA in such a way that it do

Re: [Pacemaker] Convenience Groups - WAS Re: [Linux-HA] Unordered groups (was Re: Is 'resource_set' still experimental?)

2012-04-19 Thread Vladislav Bogdanov
19.04.2012 20:48, David Vossel wrote: > - Original Message - >> From: "Alan Robertson" >> To: pacemaker@oss.clusterlabs.org, "Andrew Beekhof" >> Cc: "Dejan Muhamedagic" >> Sent: Thursday, April 19, 2012 10:22:48 AM >> Subject: [Pacemaker] Convenience Groups - WAS Re: [Linux-HA] Unordered

Re: [Pacemaker] Convenience Groups - WAS Re: [Linux-HA] Unordered groups (was Re: Is 'resource_set' still experimental?)

2012-04-19 Thread Alan Robertson
That's a very interesting idea. Since I'm generating the configuration from a script, that would be easy to add. Also, I could have a variety of different - even overlapping - pseudo-groups implemented in this way. On 04/19/2012 11:48 AM, David Vossel wrote: - Original Message -

Re: [Pacemaker] Convenience Groups - WAS Re: [Linux-HA] Unordered groups (was Re: Is 'resource_set' still experimental?)

2012-04-19 Thread Rasto Levrinc
On Thu, Apr 19, 2012 at 5:22 PM, Alan Robertson wrote: > Hi Andrew, > > I'm currently working on a fairly large cluster with lots of resources > related to attached hardware.  There are 59 of these things and 24 of those > things and so on and each of them has its own resource to deal with the the

Re: [Pacemaker] Convenience Groups - WAS Re: [Linux-HA] Unordered groups (was Re: Is 'resource_set' still experimental?)

2012-04-19 Thread David Vossel
- Original Message - > From: "Alan Robertson" > To: pacemaker@oss.clusterlabs.org, "Andrew Beekhof" > Cc: "Dejan Muhamedagic" > Sent: Thursday, April 19, 2012 10:22:48 AM > Subject: [Pacemaker] Convenience Groups - WAS Re: [Linux-HA] Unordered groups > (was Re: Is 'resource_set' still >

Re: [Pacemaker] new user with a question

2012-04-19 Thread Sean Roe
would it be possible to see your resource agent script? Thanks, Sean On Thu, Apr 19, 2012 at 1:17 AM, Andreas Kurz wrote: > On 04/19/2012 12:38 AM, Sean Roe wrote: > > I was planning on running the bacula-sd daemon on the openfiler pair. > > That is why I was asking about setting up the bacula-

Re: [Pacemaker] Periodically appear non-existent nodes

2012-04-19 Thread David Vossel
- Original Message - > From: "Vladislav Bogdanov" > To: pacemaker@oss.clusterlabs.org > Sent: Thursday, April 19, 2012 4:06:33 AM > Subject: Re: [Pacemaker] Periodically appear non-existent nodes > > 19.04.2012 11:24, Andreas Kurz wrote: > > On 04/18/2012 11:46 PM, ruslan usifov wrote: >

Re: [Pacemaker] Periodically appear non-existent nodes

2012-04-19 Thread David Vossel
- Original Message - > From: "ruslan usifov" > To: "The Pacemaker cluster resource manager" > Sent: Tuesday, April 17, 2012 6:46:00 AM > Subject: Re: [Pacemaker] Periodically appear non-existent nodes > > > 2012/4/17 Andreas Kurz < andr...@hastexo.com > > > > > > On 04/14/2012 11:14

[Pacemaker] Convenience Groups - WAS Re: [Linux-HA] Unordered groups (was Re: Is 'resource_set' still experimental?)

2012-04-19 Thread Alan Robertson
Hi Andrew, I'm currently working on a fairly large cluster with lots of resources related to attached hardware. There are 59 of these things and 24 of those things and so on and each of them has its own resource to deal with the the "things". They are not clones, and can't easily be made cl

Re: [Pacemaker] OCF Resource agent monitor activity failed due to temporary error

2012-04-19 Thread Kulovits Christian - OS ITSC
>You want pacemaker to ignore monitor errors on all unknown return values >and go on with monitoring until a resource "heals" itself? Definitely not. I do not want to let pacemaker ignore all unknown return values. I ever thought that pacemaker is a tool for HA. > please rethink ... it is a r

Re: [Pacemaker] start/stop operations fail to happen in parallel on resources

2012-04-19 Thread David Vossel
- Original Message - > From: "Parshvi" > To: pacema...@clusterlabs.org > Sent: Thursday, April 19, 2012 6:22:01 AM > Subject: [Pacemaker] start/stop operations fail to happen in parallel on > resources > > Observations: > max-children=30 > total no. of resources=18 > > 1) At a defa

Re: [Pacemaker] Corosync service taking 100% cpu and is unable to stop gracefully

2012-04-19 Thread Dan Frincu
On Thu, Apr 19, 2012 at 4:14 PM, Parshvi wrote: > Dan Frincu writes: > >> >> Hi, >> >> On Thu, Apr 19, 2012 at 2:11 PM, Parshvi gmail.com> wrote: >> > Major issues: >> > 1) Corosync reaching over 100% cpu usage. >> > 2) Corosync unable to stop gracefully. >> > 3) Virtual IP of a resources being

Re: [Pacemaker] crm_mon on Node-2 shows both Node-1 & Node-2 as online but crm_mon on Node-1 shows Node-2 as offline

2012-04-19 Thread Dan Frincu
Hi, On Thu, Apr 19, 2012 at 3:56 PM, Parshvi wrote: > 1) What is the use of ssh without pass key between cluster nodes in pacemaker > ? >  a. Use case: >    i. Two nodes in a cluster (Call them Node-1 and Node-2) >    ii. One interface configured in corosync.conf for its heartbeat or > messaging

Re: [Pacemaker] start/stop operations fail to happen in parallel on resources

2012-04-19 Thread Parshvi
Dan Frincu writes: > > Hi, > > >  c. SOLUTION: the max-children of lrmd was raised to 30. > >  d. ISSUES STILL OBSERVED: while 2-3 resources are stuck in start operation, > > if a rsc is issued an explicit start command `crm resource start rcs1`, > > then the > > start op on this rsc is delay

Re: [Pacemaker] Corosync service taking 100% cpu and is unable to stop gracefully

2012-04-19 Thread Parshvi
Parshvi writes: > > Dan Frincu writes: > > Can you pastebin.com your crm configure show? > Please follow the below link for the output of `crm confgiure show` http://pastebin.com/gd5ccALs ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org

Re: [Pacemaker] Corosync service taking 100% cpu and is unable to stop gracefully

2012-04-19 Thread Parshvi
Dan Frincu writes: > > Hi, > > On Thu, Apr 19, 2012 at 2:11 PM, Parshvi gmail.com> wrote: > > Major issues: > > 1) Corosync reaching over 100% cpu usage. > > 2) Corosync unable to stop gracefully. > > 3) Virtual IP of a resources being assigned as the primary IP on a interface, > > after a ca

Re: [Pacemaker] Corosync service taking 100% cpu and is unable to stop gracefully

2012-04-19 Thread Dan Frincu
Hi, On Thu, Apr 19, 2012 at 2:11 PM, Parshvi wrote: > Major issues: > 1) Corosync reaching over 100% cpu usage. > 2) Corosync unable to stop gracefully. > 3) Virtual IP of a resources being assigned as the primary IP on a interface, > after a cable disconnect/reconnect on that interface. The stat

Re: [Pacemaker] start/stop operations fail to happen in parallel on resources

2012-04-19 Thread Dan Frincu
Hi, On Thu, Apr 19, 2012 at 2:22 PM, Parshvi wrote: > Observations: > max-children=30 > total no. of resources=18 > > 1) At a default value 4 of max-children, following logs were observed > that led to monitor op’s timeout for some resources (a total of 18 rscs): >  a. “max_child_count (4) reache

[Pacemaker] crm_mon on Node-2 shows both Node-1 & Node-2 as online but crm_mon on Node-1 shows Node-2 as offline

2012-04-19 Thread Parshvi
1) What is the use of ssh without pass key between cluster nodes in pacemaker ? a. Use case: i. Two nodes in a cluster (Call them Node-1 and Node-2) ii. One interface configured in corosync.conf for its heartbeat or messaging. Eg. Bind net addr : 192.168.10.0 iii. Another interface c

Re: [Pacemaker] OCF Resource agent monitor activity failed due to temporary error

2012-04-19 Thread Andreas Kurz
On 04/19/2012 01:59 PM, Kulovits Christian - OS ITSC wrote: > Hi Andreas, > Exactly this is what i want pacemaker to do when my RA is not able to > determine the resource´s state. But without running into timeout and restart. > It's the method to display the resource´s state that is unavailable no

Re: [Pacemaker] OCF Resource agent monitor activity failed due to temporary error

2012-04-19 Thread Kulovits Christian - OS ITSC
Hi Andreas, Exactly this is what i want pacemaker to do when my RA is not able to determine the resource´s state. But without running into timeout and restart. It's the method to display the resource´s state that is unavailable not the resource itself. This typically approach must be coded in eve

Re: [Pacemaker] Are there known issues of having two cluster services in the same network (pacemaker and ocfs2)?

2012-04-19 Thread Parshvi
Parshvi writes: Issues observed at our end: > 1) CIB (all files under /var/lib/heartbeat/crm) getting deleted/shadowed, > when the communication link between the two nodes is broken. This is not always reproducible and we have observed it 3-4 times. __

Re: [Pacemaker] Are there known issues of having two cluster services in the same network (pacemaker and ocfs2)?

2012-04-19 Thread Parshvi
Parshvi writes: > Issues observed at our end: > 1) CIB (all files under /var/lib/heartbeat/crm) getting deleted/shadowed, > when the communication link between the two nodes is broken. This is not always reproducible and we have observed it 3-4 times. __

Re: [Pacemaker] OCF Resource agent monitor activity failed due to temporary error

2012-04-19 Thread Andreas Kurz
Hi Christian, On 04/19/2012 01:38 PM, Kulovits Christian - OS ITSC wrote: > Hi, Andreas > > What if the RA gets a response from an external command in the form: "display > currently unavailable, try later". The RA has 3 possibly states available, > "Running", "Not Running", "Failed". But in thi

[Pacemaker] "init: Id crm: respawning too fast: disabled for 5 minutes"

2012-04-19 Thread Parshvi
“init: Id crm : respawning too fast: disabled for 5 minutes” Following entry has been made in /etc/inittab for snmp: crm:2345:respawn:/usr/sbin/crm_mon --daemonize -S 192.168.127.1 Query: Why and when is this log observed ? ___ Pacemaker mailing lis

[Pacemaker] Are there known issues of having two cluster services in the same network (pacemaker and ocfs2)?

2012-04-19 Thread Parshvi
Query: Are there known issues of having two cluster services in the same network ? Details of use case environment: 1) Ocfs2 and pacemaker being used as two cluster services in our system. 2) Ocfs2 is not configured under pacemaker. Although services configured in pacemaker use the shared storag

Re: [Pacemaker] OCF Resource agent monitor activity failed due to temporary error

2012-04-19 Thread Kulovits Christian - OS ITSC
Hi, Andreas What if the RA gets a response from an external command in the form: "display currently unavailable, try later". The RA has 3 possibly states available, "Running", "Not Running", "Failed". But in this situation he would say "don't know". When I set "on-fail=ignore" this error will b

[Pacemaker] Are there known issues of having two cluster services in the same network (pacemaker and ocfs2)?

2012-04-19 Thread Parshvi
Query: Are there known issues of having two cluster services in the same network ? Details of use case environment: 1) Ocfs2 and pacemaker being used as two cluster services in our system. 2) Ocfs2 is not configured under pacemaker. Although services configured in pacemaker use the shared storage

[Pacemaker] a virtualDomain cannot be stopped

2012-04-19 Thread cherish
First I define a virtualdomain named test1: crm(live)configure# primitive test1 ocf:heartbeat:VirtualDomain \ > params config=/mnt/nfs/pacemaker_test/test1.xml migration_transport=tcp \ > op migrate_from interval=0 timeout=240s \ > op migrate_to interval=0 timeout=240s \ > op start interval=0 time

[Pacemaker] start/stop operations fail to happen in parallel on resources

2012-04-19 Thread Parshvi
Observations: max-children=30 total no. of resources=18 1) At a default value 4 of max-children, following logs were observed that led to monitor op’s timeout for some resources (a total of 18 rscs): a. “max_child_count (4) reached, postponing execution of operation monitor” b. “WARN: perform

[Pacemaker] Corosync service taking 100% cpu and is unable to stop gracefully

2012-04-19 Thread Parshvi
Major issues: 1) Corosync reaching over 100% cpu usage. 2) Corosync unable to stop gracefully. 3) Virtual IP of a resources being assigned as the primary IP on a interface, after a cable disconnect/reconnect on that interface. The static IP on the interface shown as global secondary IP. Use case

[Pacemaker] Subscription to pacemaker

2012-04-19 Thread Parshvi Srivastava
parsh...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: htt

Re: [Pacemaker] LVM restarts after SLES upgrade

2012-04-19 Thread Frank Meier
Hi, I already opend a ticket at novell, butsince 3 days there are no response. Mit freundlichen Grüßen Frank Meier UNIX-Basis Hamm Reno Group GmbH Industriegebiet West | D-66987 Thaleischweiler-Fröschen T.+49(0)6334 444-8322 | F.+49(0)6334 444-8190 frank.me...@hr-group.de | www.reno.de

Re: [Pacemaker] OCF Resource agent monitor activity failed due to temporary error

2012-04-19 Thread Andreas Kurz
On 04/19/2012 11:35 AM, emmanuel segura wrote: > on-fail attribute well, if you ignore a monitor failure you actually can disable monitoring completely. The correct way to deal with that problem is to fix the RA ... patches are always welcome ;-) Regards, Andreas -- Need help with Pacemaker? h

Re: [Pacemaker] Periodically appear non-existent nodes

2012-04-19 Thread Andreas Kurz
On 04/19/2012 11:06 AM, Vladislav Bogdanov wrote: > 19.04.2012 11:24, Andreas Kurz wrote: >> On 04/18/2012 11:46 PM, ruslan usifov wrote: >>> >>> >>> 2012/4/18 Andreas Kurz mailto:andr...@hastexo.com>> >>> >>> On 04/17/2012 09:31 PM, ruslan usifov wrote: >>> > >>> > >>> > 2012/4/17

Re: [Pacemaker] OCF Resource agent monitor activity failed due to temporary error

2012-04-19 Thread emmanuel segura
on-fail attribute Il giorno 19 aprile 2012 11:29, Kulovits Christian - OS ITSC < christian.kulov...@austrian.com> ha scritto: > Hi, > > During a monitor activity for a SRDF Resource a temporary error occurred > and the resource agent cannot determine the state of the resource and > returned

[Pacemaker] OCF Resource agent monitor activity failed due to temporary error

2012-04-19 Thread Kulovits Christian - OS ITSC
Hi, During a monitor activity for a SRDF Resource a temporary error occurred and the resource agent cannot determine the state of the resource and returned OCF_ERR_GENERIC. The cluster restarted the resource and all depending resources as designed. Is there a way to say that this failed monitor

Re: [Pacemaker] Periodically appear non-existent nodes

2012-04-19 Thread Vladislav Bogdanov
19.04.2012 11:24, Andreas Kurz wrote: > On 04/18/2012 11:46 PM, ruslan usifov wrote: >> >> >> 2012/4/18 Andreas Kurz mailto:andr...@hastexo.com>> >> >> On 04/17/2012 09:31 PM, ruslan usifov wrote: >> > >> > >> > 2012/4/17 Proskurin Kirill > >>

Re: [Pacemaker] LVM restarts after SLES upgrade

2012-04-19 Thread Lars Marowsky-Bree
On 2012-04-19T08:29:54, Frank Meier wrote: > Hi, > > I've installed a 2-Node Xen-Cluster with SLES 11 SP1. > > After an upgrade to SLES11 SP2 the cluster won't work as the old one. Can you report this to SUSE's support channel please? > Apr 15 22:01:42 xencluster2 lrmd: [7675]: WARN: clvm-xen

Re: [Pacemaker] Pacemaker Digest, Vol 53, Issue 42

2012-04-19 Thread Frank Meier
01:40 xencluster2 crmd: [7678]: info: do_pe_invoke: Query >> 984: >>>>>> Requesting the current CIB: S_POLICY_ENGINE >>>>>> Apr 15 22:01:40 xencluster2 corosync[7666]: [TOTEM ] Retransmit >> List: >>>>>> 2196 2197 >>>>&g

Re: [Pacemaker] Periodically appear non-existent nodes

2012-04-19 Thread Andreas Kurz
On 04/18/2012 11:46 PM, ruslan usifov wrote: > > > 2012/4/18 Andreas Kurz mailto:andr...@hastexo.com>> > > On 04/17/2012 09:31 PM, ruslan usifov wrote: > > > > > > 2012/4/17 Proskurin Kirill > >

Re: [Pacemaker] new user with a question

2012-04-19 Thread Andreas Kurz
On 04/19/2012 12:38 AM, Sean Roe wrote: > I was planning on running the bacula-sd daemon on the openfiler pair. > That is why I was asking about setting up the bacula-sd daemon under > pacemaker. > > our current setup is has the two nfs servers setup in an active-backup > cluster, where there is a

Re: [Pacemaker] Pacemaker Digest, Vol 53, Issue 40

2012-04-19 Thread Frank Meier
o_last_failure_0 on xencluster1: unknown >>>> error (1) >>>> Apr 15 22:01:40 xencluster2 pengine: [7677]: notice: RecurringOp: Start >>>> recurring monitor (10s) for clvm-xenvg:0 on xenc

Re: [Pacemaker] Pacemaker Digest, Vol 53, Issue 38

2012-04-19 Thread Frank Meier
gt;> Initiating action 90: stop vm-virenscanner_stop_0 on xencluster1 >> Apr 15 22:01:40 xencluster2 crmd: [7678]: info: te_rsc_command: >> Initiating action 92: stop vm-deprepo_stop_0 on xencluster1 >> Apr 15 22:01:40 xencluster2 c