Re: [Pacemaker] Dual primary drbd + ocfs2: problems starting o2cb

Elmar Marschke Wed, 21 Aug 2013 07:15:20 -0700


Am 19.08.2013 16:10, schrieb Jake Smith:

-----Original Message-----
From: Elmar Marschke [mailto:elmar.marsc...@schenker.at]
Sent: Friday, August 16, 2013 10:31 PM
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] Dual primary drbd + ocfs2: problems starting

o2cb



Am 16.08.2013 15:46, schrieb Jake Smith:

-----Original Message-----
From: Elmar Marschke [mailto:elmar.marsc...@schenker.at]
Sent: Friday, August 16, 2013 9:05 AM
To: The Pacemaker cluster resource manager
Subject: [Pacemaker] Dual primary drbd + ocfs2: problems starting
o2cb

Hi all,

i'm working on a two node pacemaker cluster with dual primary drbd
and ocfs2.

Dual pri drbd and ocfs2 WITHOUT pacemaker work fine (mounting,
reading, writing, everything...).

When i try to make this work in pacemaker, there seems to be a
problem

to

start the o2cb resource.

My (already simplified) configuration is:
-----------------------------------------
node poc1 \
        attributes standby="off"
node poc2 \
        attributes standby="off"
primitive res_dlm ocf:pacemaker:controld \
        op monitor interval="120"
primitive res_drbd ocf:linbit:drbd \
        params drbd_resource="r0" \
        op stop interval="0" timeout="100" \
        op start interval="0" timeout="240" \
        op promote interval="0" timeout="90" \
        op demote interval="0" timeout="90" \
        op notifiy interval="0" timeout="90" \
        op monitor interval="40" role="Slave" timeout="20" \
        op monitor interval="20" role="Master" timeout="20"
primitive res_o2cb ocf:pacemaker:o2cb \
        op monitor interval="60"
ms ms_drbd res_drbd \
        meta notify="true" master-max="2" master-node-max="1" target-
role="Started"
property $id="cib-bootstrap-options" \
        no-quorum-policy="ignore" \
        stonith-enabled="false" \
        dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        last-lrm-refresh="1376574860"


Looks like you are missing ordering and colocation and clone (even
group to make it a shorter config; group = order and colocation in one
statement) statements.  The resources *must* start in a particular
order and they much run on the same node and there must be an instance
of each resource on each node.

More here for DRBD 8.4:
http://www.drbd.org/users-guide/s-ocfs2-pacemaker.html
Or DRBD 8.3:
http://www.drbd.org/users-guide-8.3/s-ocfs2-pacemaker.html

Basically add:
Group grp_dlm_o2cb res_dlm res_o2cb
Clone cl_dlm_o2cb grp_dlm_o2cb meta interleave=true Order
ord_drbd_then_dlm_o2cb  res_drbd:promote cl_dlm_o2cb:start

Colocation

col_dlm_o2cb_with_drbdmaster cl_dlm_o2cb res_drbd:Master

HTH

Jake


Hello Jake,

thanks for your reply. I already had res_dlm and res_o2cb grouped

together

and cloned like in your advice; indeed this was my initial

configuration. But

the problem showed up, so i tried to simplify the configuration to

reduce

possible error sources.

But now it seems i found a solution; or at least a workaround: i just

use the

LSB resource agent lsb:o2cb. This one works! The resource starts without

problem on both nodes and as far as i can see right now everything is

fine

(tried with and without additional group and clone resource).

Don't know if this will bring some drawbacks in the future; but for the
moment my problem seems to be solved.


Not sure either - usually resource agents are more robust than simple LSB.
I would also verify that the o2cb LSB is fully LSB compliant or your
cluster will have issues


Currently it seems to me that there's a subtle problem with the
ocf:pacemaker:o2cb resource agent; at least on my system.


Maybe, maybe not - if you take a look at the o2cb resource agent the error
message you were getting is after trying to start
/usr/sbin/ocfs2_controld.pcmk for 10 seconds without success... I would
time starting o2cb.  Might be as simple as allowing more time for startup
of the daemon.
I've not setup ocfs2 in a while but I believe you may be able to extend
that timeout in the meta of the primitive without having to muck with the
actual resource agent.

Jake


Hello Jake,

yes, i think thats possible as you wrote. This is what ra metaocf:pacemaker:o2cb says:


daemon_timeout (string, [10]): Daemon Timeout
    Number of seconds to allow the control daemon to come up

Thanks for the hint. I'll check that out when possible and see if itchanges the behaviour. Currently i'm fine with lsb:o2cb...


regards


Anyway, thanks a lot for your answer..!
Best regards
elmar

First error message in corosync.log as far as i can identify it:
----------------------------------------------------------------
lrmd: [5547]: info: RA output: (res_dlm:probe:stderr)

dlm_controld.pcmk:

no process found
[ other stuff ]
lrmd: [5547]: info: RA output: (res_dlm:start:stderr)

dlm_controld.pcmk:

no process found
[ other stuff ]
    lrmd: [5547]: info: RA output: (res_o2cb:start:stderr)
2013/08/16_13:25:20 ERROR: ocfs2_controld.pcmk did not come up

(
You can find the whole corosync logfile (starting corosync on node 1

from

beginning until after starting of resources) on:
http://www.marschke.info/corosync_drei.log
)

syslog shows:
-------------
ocfs2_controld.pcmk[5774]: Unable to connect to CKPT: Object does not
exist


Output of crm_mon:
------------------
============
Stack: openais
Current DC: poc1 - partition WITHOUT quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
4 Resources configured.
============

Online: [ poc1 ]
OFFLINE: [ poc2 ]

    Master/Slave Set: ms_drbd [res_drbd]
        Masters: [ poc1 ]
        Stopped: [ res_drbd:1 ]
    res_dlm     (ocf::pacemaker:controld):      Started poc1

Migration summary:
* Node poc1:
      res_o2cb: migration-threshold=1000000 fail-count=1000000

Failed actions:
       res_o2cb_start_0 (node=poc1, call=6, rc=1, status=complete):
unknown error

---------------------------------------------------------------------
This is the situation after a reboot of node poc1. For simplification
i

left

pacemaker / corosync unstarted on the second node, and already
removed a group and a clone resource where dlm and o2cb already had
been in

(errors

were there also).

Is my configuration of the resource agents correct?
I checked using "ra meta ...", but as far as i recognized everything
is

ok.


Is some piece of software missing?
dlm-pcmk is installed, ocfs2_controld.pcmk and dlm_controld.pcmk are
available, i even did additional links in /usr/sbin:
root@poc1:~# which ocfs2_controld.pcmk /usr/sbin/ocfs2_controld.pcmk
root@poc1:~# which dlm_controld.pcmk /usr/sbin/dlm_controld.pcmk
root@poc1:~#

I already googled but couldn't find any useful. Thanks for any

hints...:)


kind regards
elmar



_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Dual primary drbd + ocfs2: problems starting o2cb

Reply via email to