Re: [Pacemaker] Building a Corosync 1.4.1 RPM package for SLES11 SP1

2011-09-02 Thread Sebastian Kaps
Hi, On Thu, 01 Sep 2011 09:42:11 -0700, Steven Dake wrote: Thanks for pointing out this problem with the build tools for corosync. nss should be conditionalized. This would allow rpmbuild --with-nss or rpmbuild --without-nss from the default rpm builds. I would send a patch to the openais

[Pacemaker] Building a Corosync 1.4.1 RPM package for SLES11 SP1

2011-09-01 Thread Sebastian Kaps
Hi, I'm trying to compile Corosync v1.4.1 from source[1] and create an RPM x86_64 package for SLES11 SP1. When running "make rpm" the build process complains about a broken dependency for the nss-devel package. The package is not installed on the system - mozilla-nss (non-devel), however, is.

[Pacemaker] crm resource status and HAWK display differ after manually mounting filesystem resource

2011-08-28 Thread Sebastian Kaps
Hi, on our two-node cluster (SLES11-SP1+HAE; corosync 1.3.1, pacemaker 1.1.5) we have defined the following FS resource and its corresponding clone: primitive p_fs_wwwdata ocf:heartbeat:Filesystem \ params device="/dev/drbd1" \ directory="/mnt/wwwdata" fstype="ocfs2" \ op

Re: [Pacemaker] TOTEM: Process pause detected? Leading to STONITH...

2011-08-13 Thread Sebastian Kaps
Hi, On 12.08.2011, at 12:19, Vladislav Bogdanov wrote: >>> http://marc.info/?l=openais&m=130989380207300&w=2 > > I do not see any corosync pauses after applied it (right after it have > been posted). Although I had vacations for two weeks, all other time I Thanks for the info! I hope it will fi

Re: [Pacemaker] TOTEM: Process pause detected? Leading to STONITH...

2011-08-12 Thread Sebastian Kaps
Hi Steven, On 12.08.2011, at 02:11, Steven Dake wrote: >> We've had another one of these this morning: >> "Process pause detected for 11763 ms, flushing membership messages." >> According to the graphs that are generated from Nagios data, the load of >> that system >> jumped from 1.0 to 5.1 ca.

Re: [Pacemaker] TOTEM: Process pause detected? Leading to STONITH...

2011-08-11 Thread Sebastian Kaps
Hi, On 04.08.2011, at 18:21, Steven Dake wrote: >> Jul 31 03:51:02 node01 corosync[5870]: [TOTEM ] Process pause detected >> for 11149 ms, flushing membership messages. > > This process pause message indicates the scheduler doesn't schedule > corosync for 11 seconds which is greater then the fa

Re: [Pacemaker] Backup ring is marked faulty

2011-08-08 Thread Sebastian Kaps
Hi Steven, On 07.08.2011, at 18:44, Steven Dake wrote: > If a ring is marked faulty, it is no longer operational and there is no > longer a redundant network. Ok, but compared to a seemingly operational backup ring that appearantly ultimately causes the cluster nodes to shoot each other for no

Re: [Pacemaker] Backup ring is marked faulty

2011-08-04 Thread Sebastian Kaps
Hi Steven, On 04.08.2011, at 20:59, Steven Dake wrote: > meaning the corosync community doesn't investigate redundant ring issues > prior to corosync versions 1.4.1. Sadly, we need to use the SLES version for support reasons. I'll try to convince them to supply us with a fix for this problem. I

Re: [Pacemaker] Backup ring is marked faulty

2011-08-04 Thread Sebastian Kaps
Hi Steven, On 04.08.2011, at 18:27, Steven Dake wrote: > redundant ring is only supported upstream in corosync 1.4.1 or later. What does "supported" mean in this context, exactly? I'm asking, because we're having serious issues with these systems since they went into production (the testing p

Re: [Pacemaker] TOTEM: Process pause detected? Leading to STONITH...

2011-08-04 Thread Sebastian Kaps
Hi Steven, thanks for looking into this! > This process pause message indicates the scheduler doesn't schedule > corosync for 11 seconds which is greater then the failure detection > timeouts. What does your config file look like? What load are you running? The load at that point of time around

Re: [Pacemaker] Backup ring is marked faulty

2011-08-04 Thread Sebastian Kaps
Hello Martin, On Thu, 4 Aug 2011 08:31:07 +0200, Tegtmeier.Martin wrote: in my case it is always the slower ring that fails (the 100MB network). Does rrp_mode passive expect both rings to have the same speed? Sebastian, can you confirm that in your environment also the slower ring fails? I c

[Pacemaker] TOTEM: Process pause detected? Leading to STONITH...

2011-08-04 Thread Sebastian Kaps
Hello, here's another problem we're having: Jul 31 03:51:02 node01 corosync[5870]: [TOTEM ] Process pause detected for 11149 ms, flushing membership messages. Jul 31 03:51:11 node01 corosync[5870]: [CLM ] CLM CONFIGURATION CHANGE Jul 31 03:51:11 node01 corosync[5870]: [CLM ] New Config

Re: [Pacemaker] Backup ring is marked faulty

2011-08-03 Thread Sebastian Kaps
Hi Steven! On Tue, 02 Aug 2011 17:45:46 -0700, Steven Dake wrote: Which version of corosync? # corosync -v Corosync Cluster Engine, version '1.3.1' Copyright (c) 2006-2009 Red Hat, Inc. It's the version that comes with SLES11-SP1-HA. -- Sebastian

[Pacemaker] Backup ring is marked faulty

2011-08-02 Thread Sebastian Kaps
Hi, we're running a two-node cluster with redundant rings. Ring 0 is a 10 GB direct connection; ring 1 consists of two 1GB interfaces that are bonded in active-backup mode and routed through two independent switches for each node. The ring 1 network is our "normal" 1G LAN and should only be use

Re: [Pacemaker] DRBD monitor time out in high I/O situations

2011-07-16 Thread Sebastian Kaps
Hi! On 12.07.2011, at 12:05, Lars Marowsky-Bree wrote: [unexplained, sporadic monitor timeouts] drbd's monitor operation is not that heavy-weight; I can't immediately see why the IO load on the file system it hosts should affect it so badly. Contrary to my first assumption, the problem does

[Pacemaker] Resource Agent for Cron Jobs

2011-07-15 Thread Sebastian Kaps
Hi! I'm looking for a way to run a certain subset of cronjobs only on the active node of our cluster. It seems, the best way to achieve this would be to use different crontab files and to switch them depending on the node's state (i.e. active/standby). I've found an announcement for a "cronjob

[Pacemaker] DRBD monitor time out in high I/O situations

2011-07-12 Thread Sebastian Kaps
onitor_2, magic=2:-2;15:11:8:6f0304c9-522b-4582-a26b-cffe24afe9e2, cib=0.349.10) : Old event Jul 11 11:07:37 node01 crmd: [25014]: WARN: update_failcount: Updating failcount for p_drbd_wwwdata:0 on node01 after failed monitor: rc=-2 (update=value++, time=1310375257) - snip