Hello Andrew and cluster folks! In the last month we are experiencing some weird problem with cib process in one of our nodes ('gandalf'), it's a 4-node cluster. Brief description:
After some undetermined reason (we still can't figure out why) it begins looping infinitely and consuming 100% CPU. After that crm_mon command can't connect to pacemaker returning this output: Could not establish cib_ro connection: Resource temporarily unavailable (11) Connection to cluster failed: Transport endpoint is not connected Pacemaker process look like this: [DB1] gandalf # psx | grep pacemaker root 25966 0.0 0.0 80480 2840 ? S May23 0:15 pacemakerd 496 25972 83.8 0.0 111632 27888 ? Rs May23 4045:07 \_ /usr/libexec/pacemaker/cib root 25973 0.0 0.0 101716 12424 ? Ss May23 0:19 \_ /usr/libexec/pacemaker/stonithd root 25974 0.0 0.0 76644 3552 ? Ss May23 0:30 \_ /usr/libexec/pacemaker/lrmd 496 25975 0.0 0.0 89624 3368 ? Ss May23 0:15 \_ /usr/libexec/pacemaker/attrd 496 25976 0.0 0.0 81172 2568 ? Ss May23 0:14 \_ /usr/libexec/pacemaker/pengine root 25977 0.0 0.0 107700 7116 ? Ss May23 0:17 \_ /usr/libexec/pacemaker/crmd Cluster is still operating normally with all resources running and this node is reported alive in the other 3 members: [VM2] lorien # crm_mon -1 Last updated: Mon May 26 16:38:18 2014 Last change: Fri May 23 08:02:17 2014 via cibadmin on lorien.san01.cooperativaobrera.coop Stack: cman Current DC: lorien.san01.cooperativaobrera.coop - partition with quorum Version: 1.1.10-14.el6_5.2-368c726 4 Nodes configured 87 Resources configured Online: [ gandalf.san01.cooperativaobrera.coop isildur.san01.cooperativaobrera.coop lorien.san01.cooperativaobrera.coop mordor.san01.cooperativaobrera.coop ] In this moment the node is in that state, I don't want to move resources because I don't know how the cluster will react in this state. Please if you want me to make some tests or collect logs I'll leave the node in that state to make any test you want. Logs stopped just before cib process started looping. Last messages are: May 23 21:02:01 [25972] gandalf.cooperativaobrera.coop cib: info: crm_compress_string: Compressed 258760 bytes into 14072 (ratio 18:1) in 67ms May 23 21:02:01 [25972] gandalf.cooperativaobrera.coop cib: info: crm_client_destroy: Destroying 0 events May 23 21:03:01 [25972] gandalf.cooperativaobrera.coop cib: info: crm_client_new: Connecting 0x2ddec30 for uid=0 gid=0 pid=17759 id=2ce51a5a-a70e-4b24-8726-b97b6f9013fd Finally, cib process in this state can't be killed. Not even with "-9". We have to reboot the node to clean pacemaker and start again. System is CentOS 6, with official packages. We had version pacemaker-1.1.10-14.el6_5.2.x86_64. After the last reboot we upgraded to pacemaker-1.1.10-14.el6_5.3.x86_64 and problem still exists. Have any of you experienced something like this? Thanks in advance for any help! Cheers -- Lic. Gabriel Gomiz - Jefe de Sistemas / Administrador ggo...@cooperativaobrera.coop Gerencia de Sistemas - Cooperativa Obrera Ltda. Tel: (0291) 403-9700
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org