Hello Andrew and cluster folks!

In the last month we are experiencing some weird problem with cib process in 
one of our nodes
('gandalf'), it's a 4-node cluster. Brief description:

After some undetermined reason (we still can't figure out why) it begins 
looping infinitely and
consuming 100% CPU. After that crm_mon command can't connect to pacemaker 
returning this output:

Could not establish cib_ro connection: Resource temporarily unavailable (11)

Connection to cluster failed: Transport endpoint is not connected

Pacemaker process look like this:

[DB1] gandalf # psx | grep pacemaker
root     25966  0.0  0.0  80480  2840 ?        S    May23   0:15 pacemakerd
496      25972 83.8  0.0 111632 27888 ?        Rs   May23 4045:07  \_ 
/usr/libexec/pacemaker/cib
root     25973  0.0  0.0 101716 12424 ?        Ss   May23   0:19  \_ 
/usr/libexec/pacemaker/stonithd
root     25974  0.0  0.0  76644  3552 ?        Ss   May23   0:30  \_ 
/usr/libexec/pacemaker/lrmd
496      25975  0.0  0.0  89624  3368 ?        Ss   May23   0:15  \_ 
/usr/libexec/pacemaker/attrd
496      25976  0.0  0.0  81172  2568 ?        Ss   May23   0:14  \_ 
/usr/libexec/pacemaker/pengine
root     25977  0.0  0.0 107700  7116 ?        Ss   May23   0:17  \_ 
/usr/libexec/pacemaker/crmd

Cluster is still operating normally with all resources running and this node is 
reported alive in
the other 3 members:

[VM2] lorien # crm_mon -1
Last updated: Mon May 26 16:38:18 2014
Last change: Fri May 23 08:02:17 2014 via cibadmin on 
lorien.san01.cooperativaobrera.coop
Stack: cman
Current DC: lorien.san01.cooperativaobrera.coop - partition with quorum
Version: 1.1.10-14.el6_5.2-368c726
4 Nodes configured
87 Resources configured

Online: [ gandalf.san01.cooperativaobrera.coop 
isildur.san01.cooperativaobrera.coop
lorien.san01.cooperativaobrera.coop mordor.san01.cooperativaobrera.coop ]

In this moment the node is in that state, I don't want to move resources 
because I don't know how
the cluster will react in this state. Please if you want me to make some tests 
or collect logs I'll
leave the node in that state to make any test you want.

Logs stopped just before cib process started looping. Last messages are:

May 23 21:02:01 [25972] gandalf.cooperativaobrera.coop        cib:     info:
crm_compress_string:       Compressed 258760 bytes into 14072 (ratio 18:1) in 
67ms
May 23 21:02:01 [25972] gandalf.cooperativaobrera.coop        cib:     info:
crm_client_destroy:        Destroying 0 events
May 23 21:03:01 [25972] gandalf.cooperativaobrera.coop        cib:     info: 
crm_client_new:   
Connecting 0x2ddec30 for uid=0 gid=0 pid=17759 
id=2ce51a5a-a70e-4b24-8726-b97b6f9013fd

Finally, cib process in this state can't be killed. Not even with "-9". We have 
to reboot the node
to clean pacemaker and start again.

System is CentOS 6, with official packages. We had version 
pacemaker-1.1.10-14.el6_5.2.x86_64. After
the last reboot we upgraded to pacemaker-1.1.10-14.el6_5.3.x86_64 and problem 
still exists.

Have any of you experienced something like this?

Thanks in advance for any help!

Cheers

-- 
Lic. Gabriel Gomiz - Jefe de Sistemas / Administrador
ggo...@cooperativaobrera.coop
Gerencia de Sistemas - Cooperativa Obrera Ltda.
Tel: (0291) 403-9700


Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to