On 18 Jun 2014, at 2:59 am, Krause, Markus <kra...@biochem.mpg.de> wrote:
> Hi all, > > I am using pacemaker/openais and drbd to have a high-available MySQL server > which worked „for years“ (sind SLES 11 SP 0) without configuration chances. > Just two days ago I did an update (zypper up) which now leads to the issue > described below. > Although the subject of my email is quite similar to many others on this list > I could not find an equivalent (so it seems) problem or a solution. I tried > to provide as much information as seemed to be reasonable to me so I > apologize in advance for the many lines! This looks ominous: error: reap_dead_nodes: We're not part of the cluster anymore It could be a software bug, but I'd start by checking the firewall. Upgrades will sometimes affect those. > > Two servers with identical configuration (hard and software) running SLES 11 > SP3 (with HAE) have the following crm config: > > ——————— > sql01a:~ # crm configure show > node sql01a > node sql01b > primitive drbd_drbd0 ocf:linbit:drbd \ > params drbd_resource="drbd0" \ > op monitor interval="14" role="Master" \ > op monitor interval="16" role="Slave" \ > op start interval="0" timeout="240" \ > op stop interval="0" timeout="240" > primitive ha_ip ocf:heartbeat:IPaddr2 \ > op monitor interval="5" timeout="20" \ > params ip="10.4.11.5" cidr_netmask="16" iflabel="ha" nic="eth0" > primitive mount_drbd0 ocf:heartbeat:Filesystem \ > params device="/dev/drbd0" fstype="ext3" directory="/var/drbd0" \ > meta target-role="Started" \ > op start interval="0" timeout="60" \ > op stop interval="0" timeout="300" > primitive mysqld ocf:heartbeat:mysql \ > params binary="/usr/sbin/mysqld" config="/etc/my.cnf" > datadir="/var/drbd0/mysql/data" user="mysql" group="mysql" > log="/var/log/mysql.log" pid="/var/lib/mysql/mysqld.pid" > socket="/var/lib/mysql/mysql.sock" test_table="ha_check.mysql_check" > test_user=„XXX" test_passwd=„XXX" enable_creation="no" \ > op monitor interval="10" timeout="60" \ > op start interval="0" timeout="240" \ > op stop interval="0" timeout="240" > primitive ping_gateway ocf:pacemaker:ping \ > params dampen="10" multiplier="100" host_list="10.4.0.1" \ > op start interval="0" timeout="60" \ > op monitor interval="5" timeout="60" > group group_mysqld mount_drbd0 mysqld ha_ip > ms ms_drbd_drbd0 drbd_drbd0 \ > meta master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" notify="true" > clone ping_clone ping_gateway > location loc_drbdmaster_drbd0 ms_drbd_drbd0 \ > rule $id="loc_drbdmaster_ping-rule" $role="Master" pingd: defined pingd > colocation col_ms_drbd_drbd0_mount_drbd0 inf: group_mysqld > ms_drbd_drbd0:Master > order ord_ms_drbd_drbd0_mount_drbd0 inf: ms_drbd_drbd0:promote > group_mysqld:start > property $id="cib-bootstrap-options" \ > dc-version="1.1.10-f3eeaf4" \ > cluster-infrastructure="classic openais (with plugin)" \ > expected-quorum-votes="2" \ > no-quorum-policy="ignore" \ > stonith-enabled="false“ > ——————— > > this worked since the first setup long ago. > after the update (zypper up) two days ago only sql01b is running, sql01a is > unable to join the cluster. > > ——————— > sql01a:~ # crm status: > Last updated: Tue Jun 17 18:20:29 2014 > Last change: Tue Jun 17 17:26:20 2014 by root via crm_attribute on sql01b > Stack: classic openais (with plugin) > Current DC: sql01b - partition with quorum > Version: 1.1.10-f3eeaf4 > 2 Nodes configured, 2 expected votes > 7 Resources configured > > > Online: [ sql01b ] > OFFLINE: [ sql01a ] > > Clone Set: ping_clone [ping_gateway] > Started: [ sql01b ] > Stopped: [ sql01a ] > Master/Slave Set: ms_drbd_drbd0 [drbd_drbd0] > Masters: [ sql01b ] > Stopped: [ sql01a ] > Resource Group: group_mysqld > mount_drbd0 (ocf::heartbeat:Filesystem): Started sql01b > mysqld (ocf::heartbeat:mysql): Started sql01b > ha_ip (ocf::heartbeat:IPaddr2): Started sql01b > ——————— > > Starting the cluster tools on sql01a („rcopenais start“) leads to the > following lines (snipped the „OK-looking“ lines at top as I enabled debugging > in corosync.conf; if necessary i can of course post the whole output, but > that’s a lot of lines ...) in /var/log/messages: > > ——————— > [snip] > Jun 17 18:20:18 sql01a corosync[14051]: [MAIN ] main.c:278 Completed > service synchronization, ready to provide service. > Jun 17 18:20:18 sql01a corosync[14051]: [TOTEM ] totempg.c:292 > waiting_trans_ack changed to 0 > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1565 WARN: > route_ais_message: Sending message to local.cib failed: ipc delivery failed > (rc=-2) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1565 WARN: > route_ais_message: Sending message to local.crmd failed: ipc delivery failed > (rc=-2) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1565 WARN: > route_ais_message: Sending message to local.cib failed: ipc delivery failed > (rc=-2) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1565 WARN: > route_ais_message: Sending message to local.crmd failed: ipc delivery failed > (rc=-2) > Jun 17 18:20:18 sql01a mgmtd: [14062]: debug: main: run the loop... > Jun 17 18:20:18 sql01a mgmtd: [14062]: info: Started. > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:486 ERROR: > pcmk_wait_dispatch: Child process cib exited (pid=14056, rc=100) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:490 notice: > pcmk_wait_dispatch: Child process cib no longer wishes to be respawned > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1672 debug: > send_cluster_id: Local update: id=17499146, born=1452, seq=1452 > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] utils.c:340 info: > update_member: Node sql01a now has process list: > 00000000000000000000000000151212 (1380882) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1672 debug: > send_cluster_id: Local update: id=17499146, born=1452, seq=1452 > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:486 ERROR: > pcmk_wait_dispatch: Child process crmd exited (pid=14061, rc=100) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:490 notice: > pcmk_wait_dispatch: Child process crmd no longer wishes to be respawned > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1672 debug: > send_cluster_id: Local update: id=17499146, born=1452, seq=1452 > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] utils.c:340 info: > update_member: Node sql01a now has process list: > 00000000000000000000000000151012 (1380370) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1672 debug: > send_cluster_id: Local update: id=17499146, born=1452, seq=1452 > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:486 ERROR: > pcmk_wait_dispatch: Child process attrd exited (pid=14059, rc=100) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:490 notice: > pcmk_wait_dispatch: Child process attrd no longer wishes to be respawned > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1672 debug: > send_cluster_id: Local update: id=17499146, born=1452, seq=1452 > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] utils.c:340 info: > update_member: Node sql01a now has process list: > 00000000000000000000000000150012 (1376274) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1672 debug: > send_cluster_id: Local update: id=17499146, born=1452, seq=1452 > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:486 ERROR: > pcmk_wait_dispatch: Child process pengine exited (pid=14060, rc=100) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:490 notice: > pcmk_wait_dispatch: Child process pengine no longer wishes to be respawned > [snip] > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1365 info: > send_member_notification: Sending membership update 1452 to 1 children > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:962 debug: > pcmk_cluster_id_callback: Node update: sql01a (1.1.10) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:962 debug: > pcmk_cluster_id_callback: Node update: sql01a (1.1.10) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1565 WARN: > route_ais_message: Sending message to local.crmd failed: ipc delivery failed > (rc=-2) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1565 WARN: > route_ais_message: Sending message to local.crmd failed: ipc delivery failed > (rc=-2) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1565 WARN: > route_ais_message: Sending message to local.cib failed: ipc delivery failed > (rc=-2) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1565 WARN: > route_ais_message: Sending message to local.crmd failed: ipc delivery failed > (rc=-2) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1565 WARN: > route_ais_message: Sending message to local.crmd failed: ipc delivery failed > (rc=-2) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1565 WARN: > route_ais_message: Sending message to local.attrd failed: ipc delivery failed > (rc=-2) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1565 WARN: > route_ais_message: Sending message to local.attrd failed: ipc delivery failed > (rc=-2) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1565 WARN: > route_ais_message: Sending message to local.attrd failed: ipc delivery failed > (rc=-2) > Jun 17 18:20:18 sql01a corosync[14051]: [pcmk ] plugin.c:1565 WARN: > route_ais_message: Sending message to local.attrd failed: ipc delivery failed > (rc=-2) > Jun 17 18:20:22 sql01a sshd[14068]: Connection closed by 10.4.1.60 [preauth] > Jun 17 18:20:23 sql01a corosync[14051]: [CONFDB] confdb.c:405 lib_init_fn: > conn=0x663570 > Jun 17 18:20:23 sql01a corosync[14051]: [CONFDB] confdb.c:411 exit_fn for > conn=0x663570 > Jun 17 18:20:27 sql01a stonith-ng[14057]: error: setup_cib: Could not > connect to the CIB service: Transport endpoint is not connected (-107) > Jun 17 18:20:27 sql01a stonith-ng[14057]: notice: crm_update_peer_state: > plugin_handle_membership: Node sql01a[17499146] - state is now member (was > (null)) > Jun 17 18:20:27 sql01a stonith-ng[14057]: notice: get_node_name: Could not > obtain a node name for classic openais (with plugin) nodeid 34276362 > Jun 17 18:20:27 sql01a stonith-ng[14057]: notice: crm_update_peer_state: > plugin_handle_membership: Node (null)[34276362] - state is now member (was > (null)) > Jun 17 18:20:27 sql01a stonith-ng[14057]: notice: plugin_handle_membership: > Membership 1452: quorum acquired > [snip] > ——————— > > some more info: > ——————— > sql01a:~ # corosync-cfgtool -s > Printing ring status. > Local node ID 17499146 > RING ID 0 > id = 10.4.11.1 > status = ring 0 active with no faults > RING ID 1 > id = 10.255.4.1 > status = ring 1 active with no faults > sql01a:~ # corosync-objctl | grep member > runtime.totem.pg.mrp.srp.members.17499146.ip=r(0) ip(10.4.11.1) r(1) > ip(10.255.4.1) > runtime.totem.pg.mrp.srp.members.17499146.join_count=1 > runtime.totem.pg.mrp.srp.members.17499146.status=joined > runtime.totem.pg.mrp.srp.members.34276362.ip=r(0) ip(10.4.11.2) r(1) > ip(10.255.4.2) > runtime.totem.pg.mrp.srp.members.34276362.join_count=1 > runtime.totem.pg.mrp.srp.members.34276362.status=joined > ——————— > This is similar to the other (working) host. > > Interesting is: > ——————— > sql01a:~ # rcopenais status > Running > sql01a:~ # ps ax | grep -E "pace|heart|coro" | grep -v grep > 14051 ? Ssl 0:00 /usr/sbin/corosync > 14057 ? S< 0:00 /usr/lib64/pacemaker/stonithd > 14058 ? S< 0:00 /usr/lib64/pacemaker/lrmd > 14062 ? S< 0:00 /usr/lib64/heartbeat/mgmtd > sql01a:~ # > ——————— > > > For digging further what’s going wrong I then started the missing processes > (cib, attrd, pengine and crmd) from the command line (using the -V switch). > cib, attrd and pengine are starting fine (even crm now shows a result, > although „wrong“) but crmd fails: > > ——————— > sql01a:~ # /usr/lib64/pacemaker/crmd -V > notice: main: CRM Git Version: f3eeaf4 > info: do_log: FSA: Input I_STARTUP from crmd_init() received in state > S_STARTING > info: config_find_next: Processing additional service options... > info: get_config_opt: Found 'openais_clm' for option: name > info: config_find_next: Processing additional service options... > info: get_config_opt: Found 'openais_evt' for option: name > info: config_find_next: Processing additional service options... > info: get_config_opt: Found 'openais_ckpt' for option: name > info: config_find_next: Processing additional service options... > info: get_config_opt: Found 'openais_amf_v2' for option: name > info: config_find_next: Processing additional service options... > info: get_config_opt: Found 'openais_msg' for option: name > info: config_find_next: Processing additional service options... > info: get_config_opt: Found 'openais_lck' for option: name > info: config_find_next: Processing additional service options... > info: get_config_opt: Found 'openais_tmr' for option: name > info: config_find_next: Processing additional service options... > info: get_config_opt: Found 'pacemaker' for option: name > info: get_config_opt: Found '0' for option: ver > info: get_cluster_type: Detected an active 'classic openais (with > plugin)' cluster > info: do_cib_control: CIB connection established > notice: crm_cluster_connect: Connecting to cluster infrastructure: > classic openais (with plugin) > info: init_cs_connection_classic: Creating connection to our Corosync > plugin > info: init_cs_connection_classic: AIS connection established > notice: get_node_name: Defaulting to uname -n for the local classic > openais (with plugin) node name > info: plugin_get_details: Server details: id=17499146 uname=sql01a > cname=pcmk > info: crm_get_peer: Created entry > 6002e998-b8ba-4daa-9c5d-1082bd1ab73a/0x65b8a0 for node sql01a/17499146 (1 > total) > info: crm_get_peer: Node 17499146 is now known as sql01a > info: peer_update_callback: sql01a is now (null) > info: crm_get_peer: Node 17499146 has uuid sql01a > info: crm_update_peer_proc: init_cs_connection_classic: Node > sql01a[17499146] - unknown is now online > info: peer_update_callback: Client sql01a/peer now has status > [online] (DC=<null>) > info: init_cs_connection_once: Connection to 'classic openais (with > plugin)': established > notice: get_node_name: Defaulting to uname -n for the local classic > openais (with plugin) node name > info: do_ha_control: Connected to the cluster > info: lrmd_ipc_connect: Connecting to lrmd > info: do_lrm_control: LRM connection established > info: do_started: Delaying start, no membership data (0000000000100000) > info: do_started: Delaying start, no membership data (0000000000100000) > info: do_started: Delaying start, no membership data (0000000000100000) > notice: plugin_handle_membership: Membership 1460: quorum acquired > info: crm_update_peer_proc: plugin_handle_membership: Node > sql01a[17499146] - unknown is now member > info: peer_update_callback: Client sql01a/peer now has status > [offline] (DC=<null>) > notice: peer_update_callback: Our peer connection failed > notice: crm_update_peer_state: plugin_handle_membership: Node > sql01a[17499146] - state is now member (was (null)) > info: peer_update_callback: sql01a is now member (was (null)) > info: crm_update_peer: plugin_handle_membership: Node sql01a: > id=17499146 state=member addr=r(0) ip(10.4.11.1) r(1) ip(10.255.4.1) (new) > votes=1 (new) born=1460 seen=1460 proc=00000000000000000000000000140012 > info: crm_get_peer: Created entry > 8925fb86-2b70-4c91-af03-a8251a9182fc/0x65ecf0 for node sql01b/34276362 (2 > total) > info: crm_get_peer: Node 34276362 is now known as sql01b > info: peer_update_callback: sql01b is now (null) > info: crm_get_peer: Node 34276362 has uuid sql01b > info: crm_update_peer_proc: plugin_handle_membership: Node > sql01b[34276362] - unknown is now member > info: peer_update_callback: Client sql01b/peer now has status > [online] (DC=<null>) > notice: crm_update_peer_state: plugin_handle_membership: Node > sql01b[34276362] - state is now member (was (null)) > info: peer_update_callback: sql01b is now member (was (null)) > info: crm_update_peer: plugin_handle_membership: Node sql01b: > id=34276362 state=member addr=r(0) ip(10.4.11.2) r(1) ip(10.255.4.2) (new) > votes=1 (new) born=1436 seen=1460 proc=00000000000000000000000000151312 > error: reap_dead_nodes: We're not part of the cluster anymore > info: plugin_handle_membership: Membership 1460: quorum retained > error: do_log: FSA: Input I_ERROR from peer_update_callback() received > in state S_STARTING > notice: do_state_transition: State transition S_STARTING -> > S_RECOVERY [ input=I_ERROR cause=C_CRMD_STATUS_CALLBACK > origin=peer_update_callback ] > warning: do_recover: Fast-tracking shutdown in response to errors > error: do_started: Start cancelled... S_RECOVERY > error: do_log: FSA: Input I_ERROR from reap_dead_nodes() received in > state S_RECOVERY > info: do_dc_release: DC role released > info: do_te_control: Transitioner is now inactive > error: do_log: FSA: Input I_TERMINATE from do_recover() received in > state S_RECOVERY > info: do_state_transition: State transition S_RECOVERY -> > S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover ] > notice: lrm_state_verify_stopped: Stopped 0 recurring operations at > shutdown (0 ops remaining) > info: do_lrm_control: Disconnecting from the LRM > info: lrmd_api_disconnect: Disconnecting from lrmd service > info: lrmd_ipc_connection_destroy: IPC connection destroyed > info: lrm_connection_destroy: LRM Connection disconnected > info: lrmd_api_disconnect: Disconnecting from lrmd service > notice: do_lrm_control: Disconnected from the LRM > info: crm_cluster_disconnect: Disconnecting from cluster > infrastructure: classic openais (with plugin) > notice: terminate_cs_connection: Disconnecting from Corosync > info: cluster_disconnect_cpg: No CPG connection > info: crm_cluster_disconnect: Disconnected from classic openais (with > plugin) > info: do_ha_control: Disconnected from the cluster > info: do_cib_control: Disconnecting CIB > info: crmd_cib_connection_destroy: Connection to the CIB > terminated... > info: do_exit: Performing A_EXIT_0 - gracefully exiting the CRMd > info: do_exit: [crmd] stopped (0) > info: crmd_exit: Dropping I_RELEASE_SUCCESS: [ state=S_TERMINATE > cause=C_FSA_INTERNAL origin=do_dc_release ] > info: crmd_exit: Dropping I_TERMINATE: [ state=S_TERMINATE > cause=C_FSA_INTERNAL origin=do_stop ] > info: crmd_cs_destroy: connection closed > info: crmd_init: 13510 stopped: OK (0) > error: crmd_fast_exit: Could not recover from internal error > info: crm_xml_cleanup: Cleaning up memory from libxml2 > sql01a:~ # > ——————— > > what I tried up to now: > - stop sql01a, remove all cib.xml and crm.xml in their folders and restart > sql01a (I know this is quite harsh but what did I have to loose …) -> the > config on sql01a has been taken from sql01b but the crmd always dies as > before (same error messages) > - I removed all cluster related packages (pacemaker, corosync, drbd, …) from > sql01a and deleted the remaining dirs and config files, reinstalled all with > copies from sql01b -> same result. > - I removed the whole crm config from sql01b (the working cluster host) using > cibadmin -Q —force and re-added the config (using crm „commands") -> same > result > - of course rebooted/restarted both servers several times -> nothing changes … > > > As deleting the config from sql01a (several times) I thought the error must > be in the config/communication on/with sql01b but I also removed and added > the config there, in addition there is now firewall between them and there is > communication. > > Does anybody know/understand what is going on here? > How can I solve this issue and get my cluster back to work as it did for > quite a long time recently? > > > thanks in advance for any hints! > > best regards, > Markus > > > > > -- .- .-. -.- ..- ... / -.- .-. .- ..- ... . > Markus Krause > Max Planck Institute of Biochemistry > - Computing Center (core services, networking, OS X) > Am Klopferspitz 18 > D-82152 Martinsried, Germany > Phone: +49(89)8578-2825 > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org