Re: [Pacemaker] Enable remote monitoring

2012-11-07 Thread Gao,Yan
Hi Andrew, On 11/08/12 13:09, Andrew Beekhof wrote: > On Tue, Nov 6, 2012 at 10:30 PM, Gao,Yan wrote: >> Hi, >> >> Currently, we can manage VMs via the VM agents. But the services running >> within VMs are not very easy to be monitored. If we could use >> nagios/icinga probes from the host to the

Re: [Pacemaker] killproc not found? o2cb shutdown via resource agent

2012-11-07 Thread Tim Serong
On 11/08/2012 12:11 PM, Andrew Beekhof wrote: > On Thu, Nov 8, 2012 at 9:59 AM, Matthew O'Connor wrote: >> Follow-up and additional info: >> >> System is Ubuntu 12.04. Not sure where killproc is supposed to be derived >> from, or if there is an assumption for it to be a standalone binary or >> sc

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-07 Thread Andrew Martin
= ==5453== For counts of detected and suppressed errors, rerun with: -v ==5453== Use --track-origins=yes to see where uninitialised values come from ==5453== ERROR SUMMARY: 715 errors from 3 contexts (suppressed: 2 from 2) Bus error (core dumped) I was also able to capture non-truncated fdata: htt

Re: [Pacemaker] Enable remote monitoring

2012-11-07 Thread Andrew Beekhof
On Tue, Nov 6, 2012 at 11:59 PM, Lars Marowsky-Bree wrote: > On 2012-11-06T19:30:20, "Gao,Yan" wrote: > > Hi Yan, > > thanks for proposing this. > > Let me try to add - > > The proposal has essentially three parts. > > First, like Yan said, a new resource agent class so that we can wrap > around

Re: [Pacemaker] Enable remote monitoring

2012-11-07 Thread Andrew Beekhof
On Tue, Nov 6, 2012 at 10:30 PM, Gao,Yan wrote: > Hi, > > Currently, we can manage VMs via the VM agents. But the services running > within VMs are not very easy to be monitored. If we could use > nagios/icinga probes from the host to the guest, that would allow us to > achieve this. > > Lars, Dej

Re: [Pacemaker] node is offline; can't bring online

2012-11-07 Thread Andrew Beekhof
On Thu, Nov 8, 2012 at 2:56 PM, Paul Archer wrote: > I don't really know when the trouble started. > I ended up restarting pacemaker on all nodes, and it cleared things > up. I'm not sure why, though. You /may/ have been experiencing a known membership issue in older versions of pacemaker and cor

Re: [Pacemaker] node is offline; can't bring online

2012-11-07 Thread Paul Archer
I don't really know when the trouble started. I ended up restarting pacemaker on all nodes, and it cleared things up. I'm not sure why, though. If I have the same issue come up, I'll run the crm_report and open a bug. Thanks, Paul On Wed, Nov 7, 2012 at 9:22 PM, Andrew Beekhof wrote: > On Thu,

Re: [Pacemaker] Build dlm_controld for pacemaker stack (dlm_controld.pcmk)

2012-11-07 Thread Andrew Beekhof
On Mon, Nov 5, 2012 at 5:33 PM, Vladislav Bogdanov wrote: > 05.11.2012 08:40, Andrew Beekhof wrote: >> On Fri, Nov 2, 2012 at 6:22 PM, Vladislav Bogdanov >> wrote: >>> 02.11.2012 02:05, Andrew Beekhof wrote: On Thu, Nov 1, 2012 at 5:09 PM, Vladislav Bogdanov wrote: > 01.11.2012 0

Re: [Pacemaker] node is offline; can't bring online

2012-11-07 Thread Andrew Beekhof
On Thu, Nov 8, 2012 at 1:55 PM, Paul Archer wrote: > I'm fairly new to pacemaker, and this is hurting my head. > I have a four-node cluster, and one of my nodes (for no reason that I > can discern) has gone offline, and I can't get it to come back online. > > Offline node: > root@vmhost2:/var/lib/

Re: [Pacemaker] After update from 1.1.7 to 1.1.8 I get - pacemakerd: get_cluster_type Pacemaker does not support the 'heartbeat' cluster infra..

2012-11-07 Thread Andrew Beekhof
On Thu, Nov 8, 2012 at 2:06 PM, Andrew Beekhof wrote: > The service directive that loads the pacemaker plugin appears to have been > lost. > Therefor Pacemaker is confused and thinks that you must be trying to > start it with Heartbeat. I've just made a change that will help Pacemaker log someth

Re: [Pacemaker] After update from 1.1.7 to 1.1.8 I get - pacemakerd: get_cluster_type Pacemaker does not support the 'heartbeat' cluster infra..

2012-11-07 Thread Andrew Beekhof
The service directive that loads the pacemaker plugin appears to have been lost. Therefor Pacemaker is confused and thinks that you must be trying to start it with Heartbeat. On Thu, Nov 8, 2012 at 1:38 PM, Jeff Johnson wrote: > Greetings, > > I had a running corosync/pacemaker two node configura

[Pacemaker] node is offline; can't bring online

2012-11-07 Thread Paul Archer
I'm fairly new to pacemaker, and this is hurting my head. I have a four-node cluster, and one of my nodes (for no reason that I can discern) has gone offline, and I can't get it to come back online. Offline node: root@vmhost2:/var/lib/heartbeat# crm_mon -1 Last updated: Wed Nov 7 20:

[Pacemaker] After update from 1.1.7 to 1.1.8 I get - pacemakerd: get_cluster_type Pacemaker does not support the 'heartbeat' cluster infra..

2012-11-07 Thread Jeff Johnson
Greetings, I had a running corosync/pacemaker two node configuration doing simple filesystem failover and stonith fencing. I decided to update to 1.1.8-4 after seeing some odd behavior and someone suggested a bug in 1.1.7-6 not playing nice with corosync 1.4.1-7. After updating my cluster will not

Re: [Pacemaker] RFC: Any interesting in 2.0.0 betas?

2012-11-07 Thread Andrew Beekhof
On Mon, Nov 5, 2012 at 7:26 PM, Vladislav Bogdanov wrote: > 05.11.2012 09:28, Andrew Beekhof wrote: > ... >>> But you can guess it, as admins usually name nodes the same way. If not >>> - that is problem of admins. >> >> No, its the problem of developers that get yelled at by admins :) > > :) > >>

Re: [Pacemaker] crm_resource command ignored?

2012-11-07 Thread Andrew Beekhof
On Thu, Nov 8, 2012 at 3:00 AM, King, Christopher wrote: > > No other crm_resource commands are being run at this time. Other cib > altering commands, using the "crm configure" interface are run in > sequence with the crm_resoucre command, but they succeed. > Hmmm. Can you try running with -

Re: [Pacemaker] killproc not found? o2cb shutdown via resource agent

2012-11-07 Thread Andrew Beekhof
On Thu, Nov 8, 2012 at 9:59 AM, Matthew O'Connor wrote: > Follow-up and additional info: > > System is Ubuntu 12.04. Not sure where killproc is supposed to be derived > from, or if there is an assumption for it to be a standalone binary or > script. I did find it defined in /lib/lsb/init-functio

Re: [Pacemaker] killproc not found? o2cb shutdown via resource agent

2012-11-07 Thread Matthew O'Connor
Follow-up and additional info: System is Ubuntu 12.04. Not sure where killproc is supposed to be derived from, or if there is an assumption for it to be a standalone binary or script. I did find it defined in /lib/lsb/init-functions. Adding a ". /lib/lsb/init-functions" to the start of the /usr

[Pacemaker] killproc not found? o2cb shutdown via resource agent

2012-11-07 Thread Matthew O'Connor
Hi, Trying to get a new cluster running, and hitting a snag when bringing nodes offline. FWIW, this very well could be nothing, but for sake of making everything run "correctly..." I pulled the following out of the syslog: Nov 7 16:59:23 hv02 o2cb[6088]: INFO: Stopping p_o2cb:0 Nov 7 16:59:23

Re: [Pacemaker] pacemaker service start failed.

2012-11-07 Thread Andrew Beekhof
You'd have to ask the resource-agents maintainers. crm_master is already setup to use it. On Wed, Nov 7, 2012 at 10:14 PM, Yuusuke Iida wrote: >> You can use crm_node --name to get the same name that Pacemaker is using. > > I want "crm_node --name" to replace "uname -n" using in RA. > Is there th

Re: [Pacemaker] crm_resource command ignored?

2012-11-07 Thread King, Christopher
Date: Wed, 7 Nov 2012 16:08:37 +1100 From: Andrew Beekhof To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] crm_resource command ignored? Message-ID: Content-Type: text/plain; charset=windows-1252 On Tue, Nov 6, 2012 at 6:15 AM, King, Christopher wrote: >> He

Re: [Pacemaker] pacemaker service start failed.

2012-11-07 Thread Yuusuke Iida
Hi, Andrew (2012/11/05 14:33), Andrew Beekhof wrote: Because I do not so know a lot in FQDN, there is not the good idea. > >I am worried about the problem that is different from this. > >When the name that I got in "uname -n" is different from the name that I got >in name solution, >A thing trea

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-07 Thread Jan Friesse
Andrew, Andrew Martin napsal(a): > A bit more data on this problem: I was doing some maintenance and had to > briefly disconnect storagequorum's connection to the STONITH network > (ethernet cable #7 in this diagram): > http://sources.xes-inc.com/downloads/storagecluster.png > > > Since coro

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-07 Thread Jan Friesse
Andrew, Andrew Martin napsal(a): > Hi Angus, > > > I recompiled corosync with the changes you suggested in exec/main.c to > generate fdata when SIGBUS is triggered. Here 's the corresponding coredump > and fdata files: > http://sources.xes-inc.com/downloads/core.13027 > http://sources.xes-i