Re: [Pacemaker] Trying to figure out a constraint

Digimer Wed, 18 Jun 2014 21:13:27 -0700

On 18/06/14 11:42 PM, Digimer wrote:

On 18/06/14 12:47 AM, Andrew Beekhof wrote:


On 18 Jun 2014, at 2:03 pm, Digimer <li...@alteeve.ca> wrote:

Hi all,

  I am trying to setup a basic pacemaker 1.1.10 on RHEL 6.5 with DRBD
8.3.16.

  I've setup DRBD and configured one clustered LVM volume group using
that drbd resource as the PV. With DRBD configured alone, I can
stop/start pacemaker repeatedly without issue. However, when I add
the LVM VG using ocf:heartbeat:LVM and setup a constraint, subsequent
restarts of pacemaker almost always end up with a fence. I have to
think then that I am messing up my constraints...


find out who is calling stonith_admin:

Jun 17 23:56:06 an-a04n01 kernel: block drbd0: helper command:
/sbin/drbdadm fence-peer minor-0
Jun 17 23:56:07 an-a04n01 kernel: block drbd0: Handshake successful:
Agreed network protocol version 97
Jun 17 23:56:07 an-a04n01 stonith_admin[28637]:   notice:
crm_log_args: Invoked: stonith_admin --fence an-a04n02.alteeve.ca
Jun 17 23:56:07 an-a04n01 stonith-ng[28356]:   notice: handle_request:
Client stonith_admin.28637.6ed13ba6 wants to fence (off)
'an-a04n02.alteeve.ca' with device '(any)'

Double check fence_pcmk includes "--tag cman" as an argument to
stonith_admin (since that will rule it out as a source).
Could drbd be initiating it?


Following up on #linux-ha discussion...

DRBD was triggering this. When I used
/usr/lib/drbd/stonith_admin-fence-peer.sh, I saw both nodes sit at
'WFConnection' before it fenced, which normally tells me there is a
network issue. However manually start DRBD never had a problem, and
after the fenced node comes back up, starting pacemaker causes it to
start fine.

So I decided to try '/usr/lib/drbd/crm-fence-peer.sh' instead. Now,
instead of fencing, an-a04n02 (node 2) fails to promote. However, if I
try running:

pcs resource debug-start drbd_r0

I get:

Operation start for drbd_r0:0 (ocf:linbit:drbd) returned 0
  >  stdout:         allow-two-primaries;
  >  stdout:
  >  stdout:
  >  stderr: WARNING: You may be disappointed: This RA is intended for
pacemaker 1.0 or better!
  >  stderr: DEBUG: r0: Calling drbdadm -c /etc/drbd.conf adjust r0
  >  stderr: DEBUG: r0: Exit code 0
  >  stderr: DEBUG: r0: Command output:
  >  stderr: DEBUG: r0: Calling /usr/sbin/crm_master -Q -l reboot -v 10000
  >  stderr: DEBUG: r0: Exit code 0
  >  stderr: DEBUG: r0: Command output:

This seems to be that it thinks it will work. However, the cluster is
left at:

Cluster name: an-anvil-04
Last updated: Wed Jun 18 23:37:43 2014
Last change: Wed Jun 18 23:19:24 2014 via cibadmin on an-a04n02.alteeve.ca
Stack: cman
Current DC: an-a04n01.alteeve.ca - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
2 Nodes configured
4 Resources configured


Online: [ an-a04n01.alteeve.ca an-a04n02.alteeve.ca ]

Full list of resources:

  fence_n01_ipmi    (stonith:fence_ipmilan):    Started
an-a04n01.alteeve.ca
  fence_n02_ipmi    (stonith:fence_ipmilan):    Started
an-a04n02.alteeve.ca
  Master/Slave Set: drbd_r0_Clone [drbd_r0]
      Masters: [ an-a04n01.alteeve.ca ]
      Slaves: [ an-a04n02.alteeve.ca ]

Then I check that constraints, I see that this has been created:

Location Constraints:
   Resource: drbd_r0_Clone
     Constraint: drbd-fence-by-handler-r0-drbd_r0_Clone
       Rule: score=-INFINITY role=Master
(id:drbd-fence-by-handler-r0-rule-drbd_r0_Clone)
         Expression: #uname ne an-a04n01.alteeve.ca
(id:drbd-fence-by-handler-r0-expr-drbd_r0_Clone)
Ordering Constraints:
Colocation Constraints:

If I delete the constraint, node 2 suddenly promotes properly.

So I have to conclude that, for some reason, the way I am using
ocf:linbit:drbd is wrong, or I've not configured
'/etc/drbd.d/global_common.conf' properly.

Speaking of which, that is:

# /etc/drbd.conf
common {
     protocol               C;
     net {
         allow-two-primaries;
         after-sb-0pri    discard-zero-changes;
         after-sb-1pri    discard-secondary;
         after-sb-2pri    disconnect;
     }
     disk {
         fencing          resource-and-stonith;
     }
     syncer {
         rate             40M;
     }
     handlers {
         fence-peer       /usr/lib/drbd/crm-fence-peer.sh;
     }
}

# resource r0 on an-a04n01.alteeve.ca: not ignored, not stacked
resource r0 {
     on an-a04n01.alteeve.ca {
         device           /dev/drbd0 minor 0;
         disk             /dev/sda5;
         address          ipv4 10.10.40.1:7788;
         meta-disk        internal;
     }
     on an-a04n02.alteeve.ca {
         device           /dev/drbd0 minor 0;
         disk             /dev/sda5;
         address          ipv4 10.10.40.2:7788;
         meta-disk        internal;
     }
}

# resource r1 on an-a04n01.alteeve.ca: not ignored, not stacked
resource r1 {
     on an-a04n01.alteeve.ca {
         device           /dev/drbd1 minor 1;
         disk             /dev/sda6;
         address          ipv4 10.10.40.1:7789;
         meta-disk        internal;
     }
     on an-a04n02.alteeve.ca {
         device           /dev/drbd1 minor 1;
         disk             /dev/sda6;
         address          ipv4 10.10.40.2:7789;
         meta-disk        internal;
     }
}

Note that, for the time being, I've not configured r1 in pacemaker to
simplify the config while debugging.

Attached is the crm_report, hopefully it might shed some light on what I
am doing wrong.

Thanks!

digimer


After sending this, I found that adding:

handlers {
        fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
        after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}

Allowed the constraint to be removed, so eventually node 2 (an-a04n02)eventually promoted, but not before going into the failed state shown above.

Subsequent stop -> start of pacemaker on both nodes started cleanly, notfence action reported in /var/log/messages. I notices this time that thedrbd module was loaded, not sure if that made a difference.


Will keep testing... Any insight is much appreciated.
--
Digimer
Papers and Projects: https://alteeve.ca/w/

What if the cure for cancer is trapped in the mind of a person withoutaccess to education?


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Trying to figure out a constraint

Reply via email to