I've spent all day working on this; even going so far as to completely
build my own set of packages from the Debian-available ones (which
appear to be different than the Ubuntu-available ones). It didn't have
any effect on the issue at all: the cluster still freaks out and
becomes a split-brain after a single SIGQUIT.
The debian packages that also demonstrate this behavior were the below
versions:
cluster-glue_1.0+hg20090915-1~bpo50+1_i386.deb
corosync_1.0.0-5~bpo50+1_i386.deb
libcorosync4_1.0.0-5~bpo50+1_i386.deb
libopenais3_1.0.0-4~bpo50+1_i386.deb
openais_1.0.0-4~bpo50+1_i386.deb
pacemaker-openais_1.0.5+hg20090915-1~bpo50+1_i386.deb
These packages were re-built (under Ubuntu Hardy Heron LTS) from the
*.diff.gz, *.dsc, and *.orig.tar.gz files available at http://people.debian.org/~madkiss/ha-corosync,
and as I said the symptoms remain exactly the same, both under the
configuration that I list below and the sample configuration that came
with these packages. I also attempted the same with a single IP
Address resource associated with the cluster; just to be sure it wasn't
an edge case for a cluster with no resources; but again that had no
effect.
Basically I'm still exactly at the point that I was at yesterday
morning at about 0900.
Remi Broemeling wrote:
I
posted this to the OpenAIS Mailing List
(open...@lists.linux-foundation.org)
yesterday, but haven't received a
response and upon further reflection I think that maybe I chose the
wrong list to post it to. That list seems to be far less about user
support and far more about developer communication. Therefore
re-trying here, as the archives show it to be somewhat more
user-focused.
The problem is that I'm having an issue with corosync refusing to
shutdown in response to a
QUIT signal. Given the below cluster (output of crm_mon):
============
Last updated: Wed Sep 23 15:56:24 2009
Stack: openais
Current DC: boot1 - partition with quorum
Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
2 Nodes configured, 2 expected votes
0 Resources configured.
============
Online: [ boot1 boot2 ]
If I go onto the host 'boot2', and issue the command "killall -QUIT
corosync", the anticipated result would be that boot2 would go offline
(out of the cluster), and all of the cluster processes
(corosync/stonithd/cib/lrmd/attrd/pengine/crmd) would shut-down.
However, this is not occurring, and I don't really have any idea why.
After logging into boot2, and issuing the command "killall -QUIT
corosync", the result is a split-brain:
>From boot1's viewpoint:
============
Last updated: Wed Sep 23 15:58:27 2009
Stack: openais
Current DC: boot1 - partition WITHOUT quorum
Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
2 Nodes configured, 2 expected votes
0 Resources configured.
============
Online: [ boot1 ]
OFFLINE: [ boot2 ]
>From boot2's viewpoint:
============
Last updated: Wed Sep 23 15:58:35 2009
Stack: openais
Current DC: boot1 - partition with quorum
Version: 1.0.5-3840e6b5a305ccb803d29b468556739e75532d56
2 Nodes configured, 2 expected votes
0 Resources configured.
============
Online: [ boot1 boot2 ]
At this point the status quo holds until such time as ANOTHER QUIT
signal is sent to corosync, (i.e. the command "killall -QUIT corosync"
is executed on boot2 again). Then, boot2 shuts down properly and
everything appears to be kosher. Basically, what I expect to happen
after a single QUIT signal is instead taking two QUIT signals to occur;
and that summarizes my question: why does it take two QUIT signals to
force corosync to actually shutdown? Is that desired behavior? From
everything online that I have read it seems to be very strange, and it
makes me think that I have a problem in my configuration(s), but I've
no idea what that would be even after playing with things and
investigating for the day.
I would be very grateful for any guidance that could be provided, as at
the moment I seem to be at an impasse.
Log files, with debugging set to 'on', can be found at the following
pastebin locations:
After first QUIT signal issued on boot2:
boot1:/var/log/syslog: http://pastebin.com/m7f9a61fd
boot2:/var/log/syslog: http://pastebin.com/d26fdfee
After second QUIT signal issued on boot2:
boot1:/var/log/syslog: http://pastebin.com/m755fb989
boot2:/var/log/syslog: http://pastebin.com/m22dcef45
OS, Software Packages, and Versions:
* two nodes, each running Ubuntu Hardy Heron LTS
* ubuntu-ha packages, as downloaded from
http://ppa.launchpad.net/ubuntu-ha-maintainers/ppa/ubuntu/:
* pacemaker-openais package version
1.0.5+hg20090813-0ubuntu2~hardy1
* openais package version 1.0.0-3ubuntu1~hardy1
* corosync package version 1.0.0-4ubuntu1~hardy2
* heartbeat-common package version
heartbeat-common_2.99.2+sles11r9-5ubuntu1~hardy1
Network Setup:
* boot1
* eth0 is 192.168.10.192
* eth1 is 172.16.1.1
* boot2
* eth0 is 192.168.10.193
* eth1 is 172.16.1.2
* boot1:eth0 and boot2:eth0 both connect to the same switch.
* boot1:eth1 and boot2:eth1 are connected directly to each other
via a cross-over cable.
* no firewalls are involved, and tcpdump shows the multicast and
UDP traffic flowing correctly over these links.
* I attempted a broadcast (rather than multicast) configuration, to
see if that would fix the problem. It did not.
`crm configure show` output:
node boot1
node boot2
property $id="cib-bootstrap-options" \
dc-version="1.0.5-3840e6b5a305ccb803d29b468556739e75532d56"
\
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore"
Contents of /etc/corosync/corosync.conf:
# Please read the corosync.conf.5 manual page
compatibility: whitetank
totem {
clear_node_high_bit: yes
version: 2
secauth: on
threads: 1
heartbeat_failures_allowed: 3
interface {
ringnumber: 0
bindnetaddr: 172.16.1.0
mcastaddr: 239.42.0.1
mcastport: 5505
}
interface {
ringnumber: 1
bindnetaddr: 192.168.10.0
mcastaddr: 239.42.0.2
mcastport: 6606
}
rrp_mode: passive
}
amf {
mode: disabled
}
service {
name: pacemaker
ver: 0
}
aisexec {
user: root
group: root
}
logging {
debug: on
fileline: off
function_name: off
to_logfile: no
to_stderr: no
to_syslog: yes
timestamp: on
logger_subsys {
subsys: AMF
debug: off
tags: enter|leave|trace1|trace2|trace3|trace4|trace6
}
}
--
Remi Broemeling
Sr System Administrator
Nexopia.com Inc.
On going to war over religion: "You're basically
killing each other to see who's got the better imaginary friend."
Rich Jeni
|