[Pacemaker] node is unable to join cluster after upgrade (crmd dies)

Krause, Markus Tue, 17 Jun 2014 10:25:31 -0700

Hi all, 

I am using pacemaker/openais and drbd to have a high-available MySQL server 
which worked „for years“ (sind SLES 11 SP 0) without configuration chances. 
Just two days ago I did an update (zypper up) which now leads to the issue 
described below.
Although the subject of my email is quite similar to many others on this list I 
could not find an equivalent (so it seems) problem or a solution. I tried to 
provide as much information as seemed to be reasonable to me so I apologize in 
advance for the many lines!


Two servers with identical configuration (hard and software) running SLES 11 
SP3 (with HAE) have the following crm config:

———————
sql01a:~ # crm configure show
node sql01a
node sql01b
primitive drbd_drbd0 ocf:linbit:drbd \
        params drbd_resource="drbd0" \
        op monitor interval="14" role="Master" \
        op monitor interval="16" role="Slave" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="240"
primitive ha_ip ocf:heartbeat:IPaddr2 \
        op monitor interval="5" timeout="20" \
        params ip="10.4.11.5" cidr_netmask="16" iflabel="ha" nic="eth0"
primitive mount_drbd0 ocf:heartbeat:Filesystem \
        params device="/dev/drbd0" fstype="ext3" directory="/var/drbd0" \
        meta target-role="Started" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="300"
primitive mysqld ocf:heartbeat:mysql \
        params binary="/usr/sbin/mysqld" config="/etc/my.cnf" 
datadir="/var/drbd0/mysql/data" user="mysql" group="mysql" 
log="/var/log/mysql.log" pid="/var/lib/mysql/mysqld.pid" 
socket="/var/lib/mysql/mysql.sock" test_table="ha_check.mysql_check" 
test_user=„XXX" test_passwd=„XXX" enable_creation="no" \
        op monitor interval="10" timeout="60" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="240"
primitive ping_gateway ocf:pacemaker:ping \
        params dampen="10" multiplier="100" host_list="10.4.0.1" \
        op start interval="0" timeout="60" \
        op monitor interval="5" timeout="60"
group group_mysqld mount_drbd0 mysqld ha_ip
ms ms_drbd_drbd0 drbd_drbd0 \
        meta master-max="1" master-node-max="1" clone-max="2" 
clone-node-max="1" notify="true"
clone ping_clone ping_gateway
location loc_drbdmaster_drbd0 ms_drbd_drbd0 \
        rule $id="loc_drbdmaster_ping-rule" $role="Master" pingd: defined pingd
colocation col_ms_drbd_drbd0_mount_drbd0 inf: group_mysqld ms_drbd_drbd0:Master
order ord_ms_drbd_drbd0_mount_drbd0 inf: ms_drbd_drbd0:promote 
group_mysqld:start
property $id="cib-bootstrap-options" \
        dc-version="1.1.10-f3eeaf4" \
        cluster-infrastructure="classic openais (with plugin)" \
        expected-quorum-votes="2" \
        no-quorum-policy="ignore" \
        stonith-enabled="false“
———————

this worked since the first setup long ago.
after the update (zypper up) two days ago only sql01b is running, sql01a is 
unable to join the cluster.

———————
sql01a:~ # crm status:
Last updated: Tue Jun 17 18:20:29 2014
Last change: Tue Jun 17 17:26:20 2014 by root via crm_attribute on sql01b
Stack: classic openais (with plugin)
Current DC: sql01b - partition with quorum
Version: 1.1.10-f3eeaf4
2 Nodes configured, 2 expected votes
7 Resources configured


Online: [ sql01b ]
OFFLINE: [ sql01a ]

 Clone Set: ping_clone [ping_gateway]
     Started: [ sql01b ]
     Stopped: [ sql01a ]
 Master/Slave Set: ms_drbd_drbd0 [drbd_drbd0]
     Masters: [ sql01b ]
     Stopped: [ sql01a ]
 Resource Group: group_mysqld
     mount_drbd0        (ocf::heartbeat:Filesystem):    Started sql01b 
     mysqld     (ocf::heartbeat:mysql): Started sql01b 
     ha_ip      (ocf::heartbeat:IPaddr2):       Started sql01b
———————

Starting the cluster tools on sql01a („rcopenais start“) leads to the following 
lines (snipped the „OK-looking“ lines at top as I enabled debugging in 
corosync.conf; if necessary i can of course post the whole output, but that’s a 
lot of lines ...) in /var/log/messages:

——————— 
[snip]
Jun 17 18:20:18 sql01a corosync[14051]:   [MAIN  ] main.c:278 Completed service 
synchronization, ready to provide service.
Jun 17 18:20:18 sql01a corosync[14051]:   [TOTEM ] totempg.c:292 
waiting_trans_ack changed to 0
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: 
route_ais_message: Sending message to local.cib failed: ipc delivery failed 
(rc=-2)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: 
route_ais_message: Sending message to local.crmd failed: ipc delivery failed 
(rc=-2)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: 
route_ais_message: Sending message to local.cib failed: ipc delivery failed 
(rc=-2)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: 
route_ais_message: Sending message to local.crmd failed: ipc delivery failed 
(rc=-2)
Jun 17 18:20:18 sql01a mgmtd: [14062]: debug: main: run the loop...
Jun 17 18:20:18 sql01a mgmtd: [14062]: info: Started.
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:486 ERROR: 
pcmk_wait_dispatch: Child process cib exited (pid=14056, rc=100)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:490 notice: 
pcmk_wait_dispatch: Child process cib no longer wishes to be respawned
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1672 debug: 
send_cluster_id: Local update: id=17499146, born=1452, seq=1452
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] utils.c:340 info: 
update_member: Node sql01a now has process list: 
00000000000000000000000000151212 (1380882)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1672 debug: 
send_cluster_id: Local update: id=17499146, born=1452, seq=1452
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:486 ERROR: 
pcmk_wait_dispatch: Child process crmd exited (pid=14061, rc=100)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:490 notice: 
pcmk_wait_dispatch: Child process crmd no longer wishes to be respawned
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1672 debug: 
send_cluster_id: Local update: id=17499146, born=1452, seq=1452
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] utils.c:340 info: 
update_member: Node sql01a now has process list: 
00000000000000000000000000151012 (1380370)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1672 debug: 
send_cluster_id: Local update: id=17499146, born=1452, seq=1452
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:486 ERROR: 
pcmk_wait_dispatch: Child process attrd exited (pid=14059, rc=100)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:490 notice: 
pcmk_wait_dispatch: Child process attrd no longer wishes to be respawned
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1672 debug: 
send_cluster_id: Local update: id=17499146, born=1452, seq=1452
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] utils.c:340 info: 
update_member: Node sql01a now has process list: 
00000000000000000000000000150012 (1376274)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1672 debug: 
send_cluster_id: Local update: id=17499146, born=1452, seq=1452
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:486 ERROR: 
pcmk_wait_dispatch: Child process pengine exited (pid=14060, rc=100)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:490 notice: 
pcmk_wait_dispatch: Child process pengine no longer wishes to be respawned
[snip]
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1365 info: 
send_member_notification: Sending membership update 1452 to 1 children
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:962 debug: 
pcmk_cluster_id_callback: Node update: sql01a (1.1.10)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:962 debug: 
pcmk_cluster_id_callback: Node update: sql01a (1.1.10)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: 
route_ais_message: Sending message to local.crmd failed: ipc delivery failed 
(rc=-2)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: 
route_ais_message: Sending message to local.crmd failed: ipc delivery failed 
(rc=-2)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: 
route_ais_message: Sending message to local.cib failed: ipc delivery failed 
(rc=-2)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: 
route_ais_message: Sending message to local.crmd failed: ipc delivery failed 
(rc=-2)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: 
route_ais_message: Sending message to local.crmd failed: ipc delivery failed 
(rc=-2)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: 
route_ais_message: Sending message to local.attrd failed: ipc delivery failed 
(rc=-2)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: 
route_ais_message: Sending message to local.attrd failed: ipc delivery failed 
(rc=-2)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: 
route_ais_message: Sending message to local.attrd failed: ipc delivery failed 
(rc=-2)
Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: 
route_ais_message: Sending message to local.attrd failed: ipc delivery failed 
(rc=-2)
Jun 17 18:20:22 sql01a sshd[14068]: Connection closed by 10.4.1.60 [preauth]
Jun 17 18:20:23 sql01a corosync[14051]:   [CONFDB] confdb.c:405 lib_init_fn: 
conn=0x663570
Jun 17 18:20:23 sql01a corosync[14051]:   [CONFDB] confdb.c:411 exit_fn for 
conn=0x663570
Jun 17 18:20:27 sql01a stonith-ng[14057]:    error: setup_cib: Could not 
connect to the CIB service: Transport endpoint is not connected (-107)
Jun 17 18:20:27 sql01a stonith-ng[14057]:   notice: crm_update_peer_state: 
plugin_handle_membership: Node sql01a[17499146] - state is now member (was 
(null))
Jun 17 18:20:27 sql01a stonith-ng[14057]:   notice: get_node_name: Could not 
obtain a node name for classic openais (with plugin) nodeid 34276362
Jun 17 18:20:27 sql01a stonith-ng[14057]:   notice: crm_update_peer_state: 
plugin_handle_membership: Node (null)[34276362] - state is now member (was 
(null))
Jun 17 18:20:27 sql01a stonith-ng[14057]:   notice: plugin_handle_membership: 
Membership 1452: quorum acquired
[snip]
———————

some more info:
———————
sql01a:~ # corosync-cfgtool -s
Printing ring status.
Local node ID 17499146
RING ID 0
        id      = 10.4.11.1
        status  = ring 0 active with no faults
RING ID 1
        id      = 10.255.4.1
        status  = ring 1 active with no faults
sql01a:~ # corosync-objctl | grep member
runtime.totem.pg.mrp.srp.members.17499146.ip=r(0) ip(10.4.11.1) r(1) 
ip(10.255.4.1) 
runtime.totem.pg.mrp.srp.members.17499146.join_count=1
runtime.totem.pg.mrp.srp.members.17499146.status=joined
runtime.totem.pg.mrp.srp.members.34276362.ip=r(0) ip(10.4.11.2) r(1) 
ip(10.255.4.2) 
runtime.totem.pg.mrp.srp.members.34276362.join_count=1
runtime.totem.pg.mrp.srp.members.34276362.status=joined
———————
This is similar to the other (working) host.

Interesting is:
———————
sql01a:~ # rcopenais status
Running
sql01a:~ # ps ax | grep -E "pace|heart|coro" | grep -v grep
14051 ?        Ssl    0:00 /usr/sbin/corosync
14057 ?        S<     0:00 /usr/lib64/pacemaker/stonithd
14058 ?        S<     0:00 /usr/lib64/pacemaker/lrmd
14062 ?        S<     0:00 /usr/lib64/heartbeat/mgmtd
sql01a:~ # 
———————


For digging further what’s going wrong I then started the missing processes 
(cib, attrd, pengine and crmd) from the command line (using the -V switch). 
cib, attrd and pengine are starting fine (even crm now shows a result, although 
„wrong“) but crmd fails:

———————
sql01a:~ # /usr/lib64/pacemaker/crmd -V
  notice: main:         CRM Git Version: f3eeaf4
    info: do_log:       FSA: Input I_STARTUP from crmd_init() received in state 
S_STARTING
    info: config_find_next:     Processing additional service options...
    info: get_config_opt:       Found 'openais_clm' for option: name
    info: config_find_next:     Processing additional service options...
    info: get_config_opt:       Found 'openais_evt' for option: name
    info: config_find_next:     Processing additional service options...
    info: get_config_opt:       Found 'openais_ckpt' for option: name
    info: config_find_next:     Processing additional service options...
    info: get_config_opt:       Found 'openais_amf_v2' for option: name
    info: config_find_next:     Processing additional service options...
    info: get_config_opt:       Found 'openais_msg' for option: name
    info: config_find_next:     Processing additional service options...
    info: get_config_opt:       Found 'openais_lck' for option: name
    info: config_find_next:     Processing additional service options...
    info: get_config_opt:       Found 'openais_tmr' for option: name
    info: config_find_next:     Processing additional service options...
    info: get_config_opt:       Found 'pacemaker' for option: name
    info: get_config_opt:       Found '0' for option: ver
    info: get_cluster_type:     Detected an active 'classic openais (with 
plugin)' cluster
    info: do_cib_control:       CIB connection established
  notice: crm_cluster_connect:  Connecting to cluster infrastructure: classic 
openais (with plugin)
    info: init_cs_connection_classic:   Creating connection to our Corosync 
plugin
    info: init_cs_connection_classic:   AIS connection established
  notice: get_node_name:        Defaulting to uname -n for the local classic 
openais (with plugin) node name
    info: plugin_get_details:   Server details: id=17499146 uname=sql01a 
cname=pcmk
    info: crm_get_peer:         Created entry 
6002e998-b8ba-4daa-9c5d-1082bd1ab73a/0x65b8a0 for node sql01a/17499146 (1 total)
    info: crm_get_peer:         Node 17499146 is now known as sql01a
    info: peer_update_callback:         sql01a is now (null)
    info: crm_get_peer:         Node 17499146 has uuid sql01a
    info: crm_update_peer_proc:         init_cs_connection_classic: Node 
sql01a[17499146] - unknown is now online
    info: peer_update_callback:         Client sql01a/peer now has status 
[online] (DC=<null>)
    info: init_cs_connection_once:      Connection to 'classic openais (with 
plugin)': established
  notice: get_node_name:        Defaulting to uname -n for the local classic 
openais (with plugin) node name
    info: do_ha_control:        Connected to the cluster
    info: lrmd_ipc_connect:     Connecting to lrmd
    info: do_lrm_control:       LRM connection established
    info: do_started:   Delaying start, no membership data (0000000000100000)
    info: do_started:   Delaying start, no membership data (0000000000100000)
    info: do_started:   Delaying start, no membership data (0000000000100000)
  notice: plugin_handle_membership:     Membership 1460: quorum acquired
    info: crm_update_peer_proc:         plugin_handle_membership: Node 
sql01a[17499146] - unknown is now member
    info: peer_update_callback:         Client sql01a/peer now has status 
[offline] (DC=<null>)
  notice: peer_update_callback:         Our peer connection failed
  notice: crm_update_peer_state:        plugin_handle_membership: Node 
sql01a[17499146] - state is now member (was (null))
    info: peer_update_callback:         sql01a is now member (was (null))
    info: crm_update_peer:      plugin_handle_membership: Node sql01a: 
id=17499146 state=member addr=r(0) ip(10.4.11.1) r(1) ip(10.255.4.1)  (new) 
votes=1 (new) born=1460 seen=1460 proc=00000000000000000000000000140012
    info: crm_get_peer:         Created entry 
8925fb86-2b70-4c91-af03-a8251a9182fc/0x65ecf0 for node sql01b/34276362 (2 total)
    info: crm_get_peer:         Node 34276362 is now known as sql01b
    info: peer_update_callback:         sql01b is now (null)
    info: crm_get_peer:         Node 34276362 has uuid sql01b
    info: crm_update_peer_proc:         plugin_handle_membership: Node 
sql01b[34276362] - unknown is now member
    info: peer_update_callback:         Client sql01b/peer now has status 
[online] (DC=<null>)
  notice: crm_update_peer_state:        plugin_handle_membership: Node 
sql01b[34276362] - state is now member (was (null))
    info: peer_update_callback:         sql01b is now member (was (null))
    info: crm_update_peer:      plugin_handle_membership: Node sql01b: 
id=34276362 state=member addr=r(0) ip(10.4.11.2) r(1) ip(10.255.4.2)  (new) 
votes=1 (new) born=1436 seen=1460 proc=00000000000000000000000000151312
   error: reap_dead_nodes:      We're not part of the cluster anymore
    info: plugin_handle_membership:     Membership 1460: quorum retained
   error: do_log:       FSA: Input I_ERROR from peer_update_callback() received 
in state S_STARTING
  notice: do_state_transition:  State transition S_STARTING -> S_RECOVERY [ 
input=I_ERROR cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
 warning: do_recover:   Fast-tracking shutdown in response to errors
   error: do_started:   Start cancelled... S_RECOVERY
   error: do_log:       FSA: Input I_ERROR from reap_dead_nodes() received in 
state S_RECOVERY
    info: do_dc_release:        DC role released
    info: do_te_control:        Transitioner is now inactive
   error: do_log:       FSA: Input I_TERMINATE from do_recover() received in 
state S_RECOVERY
    info: do_state_transition:  State transition S_RECOVERY -> S_TERMINATE [ 
input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover ]
  notice: lrm_state_verify_stopped:     Stopped 0 recurring operations at 
shutdown (0 ops remaining)
    info: do_lrm_control:       Disconnecting from the LRM
    info: lrmd_api_disconnect:  Disconnecting from lrmd service
    info: lrmd_ipc_connection_destroy:  IPC connection destroyed
    info: lrm_connection_destroy:       LRM Connection disconnected
    info: lrmd_api_disconnect:  Disconnecting from lrmd service
  notice: do_lrm_control:       Disconnected from the LRM
    info: crm_cluster_disconnect:       Disconnecting from cluster 
infrastructure: classic openais (with plugin)
  notice: terminate_cs_connection:      Disconnecting from Corosync
    info: cluster_disconnect_cpg:       No CPG connection
    info: crm_cluster_disconnect:       Disconnected from classic openais (with 
plugin)
    info: do_ha_control:        Disconnected from the cluster
    info: do_cib_control:       Disconnecting CIB
    info: crmd_cib_connection_destroy:  Connection to the CIB terminated...
    info: do_exit:      Performing A_EXIT_0 - gracefully exiting the CRMd
    info: do_exit:      [crmd] stopped (0)
    info: crmd_exit:    Dropping I_RELEASE_SUCCESS: [ state=S_TERMINATE 
cause=C_FSA_INTERNAL origin=do_dc_release ]
    info: crmd_exit:    Dropping I_TERMINATE: [ state=S_TERMINATE 
cause=C_FSA_INTERNAL origin=do_stop ]
    info: crmd_cs_destroy:      connection closed
    info: crmd_init:    13510 stopped: OK (0)
   error: crmd_fast_exit:       Could not recover from internal error
    info: crm_xml_cleanup:      Cleaning up memory from libxml2
sql01a:~ # 
———————

what I tried up to now:
- stop sql01a, remove all cib.xml and crm.xml in their folders and restart 
sql01a (I know this is quite harsh but what did I have to loose …) -> the 
config on sql01a has been taken from sql01b but the crmd always dies as before 
(same error messages)
- I removed all cluster related packages (pacemaker, corosync, drbd, …) from 
sql01a and deleted the remaining dirs and config files, reinstalled all with 
copies from sql01b -> same result.
- I removed the whole crm config from sql01b (the working cluster host) using 
cibadmin -Q —force and re-added the config (using crm „commands") -> same result
- of course rebooted/restarted both servers several times -> nothing changes …


As deleting the config from sql01a (several times) I thought the error must be 
in the config/communication on/with sql01b but I also removed and added the 
config there, in addition there is now firewall between them and there is 
communication.

Does anybody know/understand what is going on here?
How can I solve this issue and get my cluster back to work as it did for quite 
a long time recently?


thanks in advance for any hints!

best regards,
   Markus




-- .- .-. -.- ..- ... / -.- .-. .- ..- ... .
Markus Krause
Max Planck Institute of Biochemistry
- Computing Center (core services, networking, OS X)
Am Klopferspitz 18
D-82152 Martinsried, Germany
Phone: +49(89)8578-2825

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] node is unable to join cluster after upgrade (crmd dies)

Reply via email to