Am 16.08.2010 13:29, schrieb Dejan Muhamedagic:
On Sat, Aug 14, 2010 at 06:26:58AM +0200, Cnut Jansen wrote:
Am 12.08.2010 18:46, schrieb Dejan Muhamedagic:
The migration-threshold shouldn't in any way influence resources
which don't depend on the resource which fails over. Couldn't
reproduce it here with our example RAs.

So it seems that - for what reason ever - those constrainted
resources are considered and treated just as they were in a
resource-group, because they move to where they all can run, instead
of the "eat or die" for the dependent resource (mysql) to the
underlying resource (mount) that I had expected with such
constraints as I set them... shouldn't I?! o_O
Yes, those two constraints are equivalent to a group.
So in fact migration-threshold actually does influence resources that are neither grouped with nor dependent on the failing resource, when the failing resource depends on them?!

Of course I allready knew that from groups, and there it - imho - also makes sense, since defining a group means like saying "I want to have all these resources run together on one node; no matter how and where". But when setting constraints respectively defining dependencies, at least I understand "dependency" one-sided, not mutual; meaning the underlying resource is independent towards its dependent, therefor it can do whatever it wants to do and doesn't have to care about its dependent at all, while the dependent shall only start when and where the underlying resource it depends on is started. So did I understand you right, that for Pacemaker it's actually the intentional way of working for both, groups and constraints, that they are mutual dependencies?

And if so: Is there also any possibility to define one-sided dependencies/influences?

And - concerning the failure-timeout - quite a while later, without
having resetted mysql's failure counter or having done anything else
in the meantime:

4) alpha: FC(mysql)=3, crm_resource -F -r mysql -H alpha
Aug 14 04:44:47 alpha crmd: [900]: info: process_lrm_event: LRM
operation mysql_asyncmon_0 (call=59, rc=1, cib-update=592,
confirmed=false) unknown error
Aug 14 04:44:47 alpha crmd: [900]: info: process_lrm_event: LRM
operation mysql_stop_0 (call=60, rc=0, cib-update=596,
confirmed=true) ok
Aug 14 04:44:47 alpha crmd: [900]: info: process_lrm_event: LRM
operation mount_stop_0 (call=61, rc=0, cib-update=597,
confirmed=true) ok
beta: FC(mysql)=0
Aug 14 04:44:47 beta crmd: [868]: info: process_lrm_event: LRM
operation mount_start_0 (call=40, rc=0, cib-update=96,
confirmed=true) ok
Aug 14 04:44:47 beta crmd: [868]: info: process_lrm_event: LRM
operation mysql_start_0 (call=41, rc=0, cib-update=97,
confirmed=true) ok
Aug 14 04:47:17 beta crmd: [868]: info: process_lrm_event: LRM
operation mysql_stop_0 (call=42, rc=0, cib-update=98,
confirmed=true) ok
Aug 14 04:47:17 beta crmd: [868]: info: process_lrm_event: LRM
operation mount_stop_0 (call=43, rc=0, cib-update=99,
confirmed=true) ok
alpha: FC(mysql)=4
Aug 14 04:47:17 alpha crmd: [900]: info: process_lrm_event: LRM
operation mount_start_0 (call=62, rc=0, cib-update=599,
confirmed=true) ok
Aug 14 04:47:17 alpha crmd: [900]: info: process_lrm_event: LRM
operation mysql_start_0 (call=63, rc=0, cib-update=600,
confirmed=true) ok
This worked as expected, i.e. after the 150s cluster-recheck
interval the resources were started at alpha.
Is it really "as exspected" that many(!) minutes - and even cluster-rechecks - after the last picking-on and with a failure-timeout of 45 seconds the failure counter is still not only showing a count of 3, but also obviously really being 3 (not 0, after being reset), thus now migrating resource allready on the first following picking-on?!

BTW, what's the point of cloneMountMysql? If it can run only
where drbd is master, then it can run on one node only:

colocation colocMountMysql_drbd inf: cloneMountMysql msDrbdMysql:Master
order orderMountMysql_drbd inf: msDrbdMysql:promote cloneMountMysql:start
It's a dual-primary-DRBD-configuration, so there are actually - when
everything is ok (-; - 2 masters of each DRBD-multistate-resource...
even though I admit that at least the dual primary respectively
master for msDrbdMysql is currently (quite) redundant, since in the
current cluster configuration there's only one, primitive
MySQL-resource and thus there'd be no inevitable need for MySQL's
data-dir being mounted all time on both nodes.
But since it's not harmful to have it mounted on the other node too,
and since msDrbdOpencms and msDrbdShared need to be mounted on both
nodes and since I put the complete installation and configuration of
the cluster into flexibly configurable shell-scripts, it's easier
respectively done with less typing to just put all DRBD- and
mount-resources' configuration into just one common loop. (-;
OK. It did cross my mind that it may be a dual-master drbd.

Your configuration is large. If you are going to run that in
producetion and don't really need a dual-master, then it'd be
good to get rid of the ocfs2 bits to make maintenance easier.
Well, there are 3 DRBD resources, and the other 2 DRBD resources except the DRBD for MySQL's datadir must be dual-primary allready now, since they're needed being mounted on all nodes for the Apache/Tomcat/Opencms-teams. Therefor it's indeed easier for maintenance to just keep all 3 DRBD's configurations in sync, and only requiring one little line more for cloning mountMysql. (-;

d) I also have the impression that fail-counters don't get reset
after their failure-timeout, because when migration-threshold=3 is
set, upon every(!) following picking-on those issues occure, even
when I've waited for nearly 5 minutes (with failure-timeout=90)
without any touching the cluster
That seems to be a bug though I couldn't reproduce it with a
simple configuration.
I just also tested this once again: It seems like that
failure-timeout only sets back scores from -inf to around 0
(whereever they should normally be), allowing the resources to
return back to the node. I tested with setting a location constraint
for the underlying resource (see configuration): After the
failure-timeout has been completed, on the next cluster-recheck (and
only then!) the underlying resource and its dependants return to the
underlying resource's prefered location, as you see in logs above.
The count gets reset, but the cluster acts on it only after the
cluster-recheck-interval, unless something else makes the cluster
calculate new scores.
See above, picking-on #4: More than 26 minutes after the last picking-on with settings of migration-threshold=3, timeout-failure=40 and cluster-recheck-interval=150, resources get allready migrated upon first picking-on (and shown failure-counter raises to 4). To me that doesn't look like resetting failure-counter to 0 after failure-timeout, but just resetting scores. Actually - except maybe by tricks/force - it shouldn't be possible at all to get the resource running again on the node it failed on for as long as its failure counter there has still reached migration-threshold's limit, right? How can then failure counter ever reach counts beyond migration-threshold's limit (ok, I could still imagine reasons for that) at all, and exspecially why does migration-threshold from then on behave on every failure like it was set to 1, even when it's i.e. set to 3?



_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to