I have two nodes running heartbeat 3.0.5 and pacemaker 1.1.6 (both from the
linux-ha lucid ppa). They are running 11 groups each comprising an
ocf:heartbeat:IPaddr2, an ocf:heartbeat:SendArp and an ocf:heartbeat:MailTo.
There is also a mailto resource configured for the overall cluster.
Despite all these, all the notifications I ever receive look identical:
Heartbeat status change: Migrating resource away at Mon Aug 5 13:01:49 UTC
2013 from proxy2
Command line was:
/usr/lib/ocf/resource.d//heartbeat/MailTo stop
One major omission here is that it doesn't tell me which resource it migrated.
Is there some way of configuring the cluster itself to send notifications so
that I can remove the individual mailto resources?
Coincidentally (?), I've just started to get this problem:
Aug 5 11:13:50 proxy1 heartbeat: [2958]: ERROR: glib: ucast_write: Unable to
send HBcomm packet eth0 192.168.1.10:694 len=78903 [-1]: Message too long
Aug 5 11:13:50 proxy1 heartbeat: [2958]: ERROR: write_child: write failure on
ucast eth0.: Message too long
This (well at least I assume it's this) is resulting in resources disappearing,
randomly starting and stopping, flip-flopping between nodes, marking nodes as
offline and more fun things to keep us awake at night.
The only explanation I've found for this is here
http://comments.gmane.org/gmane.linux.highavailability.pacemaker/10765
The solutions suggested are to alter compression settings (which we were not
using before), migrate to corosync and/or to make the cib smaller, hence the
idea of removing the individual mailtos.
I've run hb_report and that doesn't say anything useful, more or less "it
doesn't work".
I'd like to migrate to corosync if it's better, but I'm extremely wary of
touching anything in the cluster.
Marcus
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems