On 06/02/2014 05:52, Vladislav Bogdanov wrote:
Hi,
I bet your problem comes from the LSB clvmd init script.
Here is what it does do:
===========
...
clustered_vgs() {
${lvm_vgdisplay} 2>/dev/null | \
awk 'BEGIN {RS="VG Name"} {if (/Clustered/) print $1;}'
}
clustered_active_lvs() {
for i in $(clustered_vgs); do
${lvm_lvdisplay} $i 2>/dev/null | \
awk 'BEGIN {RS="LV Name"} {if (/[^N^O^T] available/) print $1;}'
done
}
rh_status() {
status $DAEMON
}
...
case "$1" in
...
status)
rh_status
rtrn=$?
if [ $rtrn = 0 ]; then
cvgs="$(clustered_vgs)"
echo Clustered Volume Groups: ${cvgs:-"(none)"}
clvs="$(clustered_active_lvs)"
echo Active clustered Logical Volumes: ${clvs:-"(none)"}
fi
...
esac
exit $rtrn
=========
So, it not only looks for status of daemon itself, but also tries to
list volume groups. And this operation is blocked because fencing is
still in progress, and the whole cLVM thing (as well as DLM itself and
all other dependent services) is frozen. So your resource timeouts in
monitor operation, and then pacemaker asks it to stop (unless you have
on-fail=fence). Anyways, there is a big chance that stop will fail too,
and that leads again to fencing. cLVM is very fragile in my opinion
(although newer versions running on corosync2 stack seem to be much
better). And it is probably still doesn't work well when managed by
pacemaker in CMAN-based clusters, because it blocks globally if any node
in the whole cluster is online at the cman layer but doesn't run clvmd
(I checked last time with .99). And that was the same for all stacks,
until was fixed for corosync (only 2?) stack recently. The problem with
that is that you cannot just stop pacemaker on one node (f.e. for
maintenance), you should immediately stop cman as well (or run clvmd in
cman'ish way) - cLVM freezes on another node. This should be easily
fixable in clvmd code, but nobody cares.
Thanks for the explanation, this is interresting for me as I need a
volume manager in the cluster to manager the shared file systems in case
I need to resize for some reason. I think I may be coming up against
something similar now that I am testing cman outside of the cluster,
even though I have cman/clvmd enabled outside pacemaker the clvmd daemon
still hangs even when the 2nd node has been rebooted due to a fence
operation, when it (node 2) reboots, cman & clvmd starts, I can see both
nodes as members using cman_tool, but clvmd still seems to have an
issue, it just hangs, I cant see off-hand if dlm still thinks pacemaker
is in the fence operation (or if it has already returned true for
successful fence). I am still gathering logs and will post back to this
thread once I have all my logs from yesterday and this morning.
I dont suppose there is another volume manager available that would be
cluster aware that anyone is aware of?
Increasing timeout for LSB clvmd resource probably wont help you,
because blocked (because of DLM waits for fencing) LVM operations iirc
never finish.
You may want to search for clvmd OCF resource-agent, it is available for
SUSE I think. Although it is not perfect, it should work much better for
you
I will have a look around for this clvmd ocf agent, and see what is
involverd in getting it to work on CentOS 6.5 if I dont have any success
with the current recommendation for running it outside of pacemaker control.
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org