On Fri, May 07, 2010 at 03:03:39PM +0200, Dejan Muhamedagic wrote: > Hi, > > On Fri, May 07, 2010 at 12:35:59PM +0200, Fabian Ruff wrote: > > Hi, > > > > I'm currently testing a 2-node HA-Firewall with pacemaker+cororsync > > on Debian Lenny. > > I used the latest package from the madkiss repo for the setup > > (corosync 1.2.0, pacemaker 1.0.8). > > > > I will spare you all the verbose config for now and just give you an > > overview the recource configuration: > > > > >gwa:~# crm_mon -1 > > >============ > > >Last updated: Fri May 7 12:10:19 2010 > > >Stack: openais > > >Current DC: gwb - partition with quorum > > >Version: 1.0.8-2c98138c2f070fcb6ddeab1084154cffbf44ba75 > > >2 Nodes configured, 2 expected votes > > >6 Resources configured. > > >============ > > > > > >Online: [ gwa gwb ] > > > > > > Master/Slave Set: drbd_disk > > > Masters: [ gwa ] > > > Slaves: [ gwb ] > > > Clone Set: connectivity > > > Started: [ gwb gwa ] > > > fencing_gwa (stonith:external/ipmi): Started gwb > > > fencing_gwb (stonith:external/ipmi): Started gwa > > > Resource Group: ips > > > ip_outside (ocf::heartbeat:IPaddr2): Started gwa > > > ip_backup (ocf::heartbeat:IPaddr2): Started gwa > > > ip_secure (ocf::heartbeat:IPaddr2): Started gwa > > > ip_inside (ocf::heartbeat:IPaddr2): Started gwa > > > ip_staging (ocf::heartbeat:IPaddr2): Started gwa > > > firewall (lsb:firewall): Started gwa > > > Resource Group: services > > > filesystem (ocf::heartbeat:Filesystem): Started gwa > > > openvpn (lsb:openvpn-cluster): Started gwa > > > dnsmasq (lsb:dnsmasq): Started gwa > > > > > > The cluster was running fairly stable for the past couple of weeks. > > > > But then Yesterday without any user interaction and while idle the > > active node (gwa) failed and was subsequently stonithed by the > > passive one (gwb) due to a strange error (at least to me) on allmost > > all resource agents: > > > > >gwa:~# grep -i error /var/log/syslog-20100507 > > >May 6 14:13:23 gwa lrmd: [27931]: ERROR: (raexecocf.c:execra:178) execl > > >failed for /usr/lib/ocf/resource.d//heartbeat/IPaddr2: Argument list too > > >long
man execve: E2BIG The total number of bytes in the environment (envp) and argument list (argv) is too large. line (raexecocf.c:execra:178) is execl(ra_pathname, ra_pathname, op_type, (const char *)NULL); so it is NOT the argument list, even though perror seems to thinks that's the more likely cause for this error. unless "op_type" happens to be an unterminated multi kB string somehow. (we know what ra_pathname is from the perror message). Does lrmd accumulate setenv() somehow? Or crmd sent to many parameters? -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf