[Ubuntu-ha] [Bug 1368737] Re: Pacemaker can seg fault on crm node online/standby

Peter Parzer Thu, 08 Jan 2015 08:24:04 -0800

The cluster consists of 2 HP ProLiant DL120 G7 Rack-Server as file
server with DRBD and Samba. I used the same configuration with 12.04 for
2 years without any problems.


The cluster configuration:

node $id="167772161" kjp02 \
        attributes standby="off"
node $id="167772162" kjp03 \
        attributes standby="off"
primitive drbd ocf:linbit:drbd \
        params drbd_resource="srv" \
        op monitor interval="29" role="Master" \
        op monitor interval="31" role="Slave"
primitive ip ocf:heartbeat:IPaddr2 \
        params ip="161.42.184.40" \
        op monitor interval="30" \
        meta target-role="Started"
primitive mail ocf:heartbeat:MailTo \
        params email="root" \
        meta target-role="Started"
primitive nmb upstart:nmbd \
        op monitor interval="60" \
        meta target-role="Started"
primitive quota lsb:quota \
        op monitor interval="60" \
        op start timeout="300" interval="0" \
        meta target-role="Started"
primitive smb upstart:smbd \
        op monitor interval="60" \
        meta target-role="Started"
primitive srv ocf:heartbeat:Filesystem \
        op monitor interval="60" \
        params device="/dev/drbd0" directory="/srv" fstype="ext4" 
options="noatime,acl,usrquota,user_xattr" \
        meta target-role="Started"
primitive st_kjp02 stonith:external/ipmi \
        params hostname="kjp02" ipaddr="161.42.184.42" userid="Administrator" 
passwd="***" interface="lanplus" \
        op monitor interval="120"
primitive st_kjp03 stonith:external/ipmi \
        params hostname="kjp03" ipaddr="161.42.184.44" userid="Administrator" 
passwd="***" interface="lanplus" \
        op monitor interval="120"
primitive winbind upstart:winbind \
        op monitor interval="60" \
        meta target-role="Started"
ms drbd_ms drbd \
        meta master-max="1" master-node-max="1" clone-max="2" 
clone-node-max="1" notify="true"
location st_kjp02_loc st_kjp02 -inf: kjp02
location st_kjp03_loc st_kjp03 -inf: kjp03
colocation ip_srv inf: ip srv
colocation mail_ip inf: mail ip
colocation nmb_ip inf: nmb ip
colocation quota_srv inf: quota srv
colocation smb_winbind inf: smb winbind
colocation srv_drbd inf: srv drbd_ms:Master
colocation winbind_ip inf: winbind ip
order drbd_srv inf: drbd_ms:promote srv:start
order ip_mail inf: ip mail
order ip_nmb inf: ip nmb
order ip_winbind inf: ip winbind
order srv_ip inf: srv:start ip
order srv_quota inf: srv:start quota
order winbind_smb inf: winbind smb
property $id="cib-bootstrap-options" \
        dc-version="1.1.10-42f2063" \
        cluster-infrastructure="corosync" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1416995137"


dpkg versions are attached.

Peter


** Attachment added: "dpkg-versions"
   
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/+attachment/4294035/+files/dpkg-versions

-- 
You received this bug notification because you are a member of Ubuntu
High Availability Team, which is subscribed to pacemaker in Ubuntu.
https://bugs.launchpad.net/bugs/1368737

Title:
  Pacemaker can seg fault on crm node online/standby

Status in pacemaker package in Ubuntu:
  Fix Released
Status in pacemaker source package in Trusty:
  Fix Committed
Status in pacemaker source package in Utopic:
  Fix Committed
Status in pacemaker source package in Vivid:
  Fix Released

Bug description:
  [IMPACT]

    - Pacemaker seg fault on repeated crm node online/standy because:
        - Newer glib versions uses hash_table to find GSources
        - Glib can try to assert source being removed multiple times

  [TEST CASE]

    - Using same configuration as attached cib.xml :

          #!/bin/bash

          while true; do
              crm node standby clustertrusty01
              sleep 7
              crm node online clustertrusty01
              sleep 7
              crm node standby clustertrusty02
              sleep 7
              crm node online clustertrusty02
              sleep 7
              crm node standby clustertrusty03
              sleep 7
              crm node online clustertrusty03
              sleep 7
          done

  [REGRESSION POTENTIAL]

    - Based on upstream commit 568e41d
    - Test case ran for more than 7 hours with no problems

  [OTHER INFO]

  It was brought to my attention the following situation:

  """
  [Issue]

  lrmd process crashed when repeating "crm node standby" and "crm node
  online"

  ----------------
  # grep pacemakerd ha-log.k1pm101 | grep core
  Aug 27 17:47:06 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 49275 (lrmd) dumped core
  Aug 27 17:47:06 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=49275, core=1)
  Aug 27 18:27:14 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 1471 (lrmd) dumped core
  Aug 27 18:27:14 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=1471, core=1)
  Aug 27 18:56:41 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 35771 (lrmd) dumped core
  Aug 27 18:56:41 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=35771, core=1)
  Aug 27 19:44:09 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 60709 (lrmd) dumped core
  Aug 27 19:44:09 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=60709, core=1)
  Aug 27 20:00:53 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 35838 (lrmd) dumped core
  Aug 27 20:00:53 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=35838, core=1)
  Aug 27 21:33:52 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 49249 (lrmd) dumped core
  Aug 27 21:33:52 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=49249, core=1)
  Aug 27 22:01:16 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 65358 (lrmd) dumped core
  Aug 27 22:01:16 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=65358, core=1)
  Aug 27 22:28:02 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed 
process 22693 (lrmd) dumped core
  Aug 27 22:28:02 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=22693, core=1)
  ----------------

  ----------------
  # grep pacemakerd ha-log.k1pm102 | grep core
  Aug 27 15:32:48 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed 
process 5812 (lrmd) dumped core
  Aug 27 15:32:48 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=5812, core=1)
  Aug 27 15:52:52 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed 
process 35781 (lrmd) dumped core
  Aug 27 15:52:52 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=35781, core=1)
  Aug 27 16:02:54 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed 
process 51984 (lrmd) dumped core
  Aug 27 16:02:54 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child 
process lrmd terminated with signal 11 (pid=51984, core=1)
  """

  Analyzing core file with dbgsyms I could see that:

  #0  0x00007f7184a45983 in services_action_sync (op=0x7f7185b605d0) at 
services.c:434
  434           crm_trace(" >  stdout: %s", op->stdout_data);

  Is responsible for the core.

  I've checked upstream code and there might be 2 important commits that
  could be cherry-picked to fix this behavior:

  commit f2a637cc553cb7aec59bdcf05c5e1d077173419f
  Author: Andrew Beekhof <[email protected]>
  Date:   Fri Sep 20 12:20:36 2013 +1000

      Fix: services: Prevent use-of-NULL when executing service actions

  commit 11473a5a8c88eb17d5e8d6cd1d99dc497e817aac
  Author: Gao,Yan <[email protected]>
  Date:   Sun Sep 29 12:40:18 2013 +0800

      Fix: services: Fix the executing of synchronous actions

  The core can be caused by things such as this missing code:

  if (op == NULL) {
  crm_trace("No operation to execute");
  return FALSE;

  on the beginning of "services_action_sync(svc_action_t * op)"
  function.

  And improved by commit #11473a5.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1368737/+subscriptions

_______________________________________________
Mailing list: https://launchpad.net/~ubuntu-ha
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~ubuntu-ha
More help   : https://help.launchpad.net/ListHelp

[Ubuntu-ha] [Bug 1368737] Re: Pacemaker can seg fault on crm node online/standby

Reply via email to