Dear all,

I'm having a bit of peculiar problem with Heartbeat. I'm installing a new HA 
cluster with Heartbeat and pacemaker. Right now there is only one node 
installed, because I'm preparing the install image for the entire cluster.
Anyway, since the last restart of the machine, heartbeat is in a kind of 
endless loop sending broadcasts to the network at a very high rate. Essentially 
the process is sending as fast as the CPU speed allows it to send the stuff 
(around 3kHz of broadcasts).
I'm not excluding this might be a misconfiguration on my side, but it seems 
more like a bug.
The machine is still running in that mode, in case you want me to try some 
debugging.

Thanks for looking at this,
  Rainer Schwemmer

Details about the configuration:

The resource configuration is just an empty cluster with one node and no 
resources running.

The versions of programs I am using:
The server is installed with Linux RHEL5.5, 2.6.18-128.1.6.el5 #1 SMP

--------------------------------
These are the heartbeat versions:
heartbeat-mgmt-2.0.1-1.lhcb
heartbeat-debuginfo-3.0.3-2.3.el5
heartbeat-3.0.3-2.3.el5
heartbeat-libs-3.0.3-2.3.el5
heartbeat-mgmt-debuginfo-2.0.1-1.lhcb
heartbeat-devel-3.0.3-2.3.el5
cluster-glue-1.0.6-1.6.el5
cluster-glue-libs-1.0.6-1.6.el5
(Actually the heartbeat-mgmt is the pacemaker management module)

--------------------------------
The pacemaker packages:
pacemaker-debuginfo-1.0.10-1.4.el5
pacemaker-1.0.10-1.4.el5
pacemaker-libs-devel-1.0.10-1.4.el5
pacemaker-libs-1.0.10-1.4.el5

--------------------------------
Top of the machine:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 9301 root      -2   0 71260  11m 8008 R 100.1  0.0   5905:36 heartbeat: master 
control process
10988 hacluste  15   0 73928 3036 2456 S 56.1  0.0   3853:50 
/usr/lib64/heartbeat/crmd
 9319 root      -2   0 69392 9.8m 8008 S 11.6  0.0   1030:46 heartbeat: write: 
bcast eth2
 9320 root      -2   0 69392 9.8m 8008 S  5.9  0.0 493:14.62 heartbeat: read: 
bcast eth2
16594 root      15   0 13264 1712  816 R  1.0  0.0   0:00.82 top

--------------------------------
ha.cf:
debugfile /var/log/ha-debug
logfile /var/log/ha-log

keepalive                       1
warntime                        8
deadtime                        20
initdead                        40

bcast  eth2
auto_failback on
autojoin any
crm yes
apiauth ipfail uid=hacluster
apiauth ccm uid=hacluster,root
apiauth cms uid=hacluster,root
apiauth ping gid=haclient uid=hacluster,root
apiauth default gid=haclient uid=hacluster,root
apiauth mgmtd uid=root,ebonacco,rainer,hacluster
respawn         root    /usr/lib64/heartbeat/mgmtd -v
conn_logd_time 60

We are using broadcasts here, because we had some trouble with our switches and 
multicast.

--------------------------------
This is what is inside the broadcasts it sends:
>>>
__name__=create_request_adv
__name__=create_request_adv
origin=te_rsc_command
t=crmd
version=3.0.1
subt=request
reference=lrm_invoke-tengine-1301706874-335
crm_task=lrm_invoke
crm_sys_to=lrmd
crm_sys_from=tengine
crm_host_to=store07.lbdaq.cern.ch
dest=store07.lbdaq.cern.ch
oseq=42e6c2c5
from_id=crmd
to_id=crmd
client_gen=4
src=store07
seq=42ec220d
hg=4d8f206f
ts=4d999a1a
ld=1.67 1.87 2.05 4/883 16221
ttl=3
auth=1 d20f08208a30acc15a0492d438a9eec6
crm_xml=<crm_xml><rsc_op id="2" operation="probe_complete" 
operation_key="probe_complete" on_node="store07.lbdaq.cern.ch" 
on_node_uuid="bd27618a-860b-43e8-93a1-6706b79fbea5" 
transition-key="2:164:0:ab8d208d-09e4-4b10-b5db-103808bad101"><attributes 
CRM_meta_op_no_wait="true" crm_feature_set="3.0.1"/></rsc_op></crm_xml>
client_gen=4
(1)destuuid=vSdhioYLQ+iToWcGt5++pQ==
(1)srcuuid=vSdhioYLQ+iToWcGt5++pQ==
<<<
.>>>
__name__=create_request_adv
__name__=create_request_adv
origin=te_rsc_command
t=crmd
version=3.0.1
subt=request
reference=lrm_invoke-tengine-1301572752-37
crm_task=lrm_invoke
crm_sys_to=lrmd
crm_sys_from=tengine
crm_host_to=store07.lbdaq.cern.ch
dest=store07.lbdaq.cern.ch
oseq=42e6c2c6
from_id=crmd
to_id=crmd
client_gen=4
src=store07
seq=42ec220e
hg=4d8f206f
ts=4d999a1a
ld=1.67 1.87 2.05 2/883 16221
ttl=3
auth=1 35e66064c82f6e9c9b0f87f9825ead32
crm_xml=<crm_xml><rsc_op id="2" operation="probe_complete" 
operation_key="probe_complete" on_node="store07.lbdaq.cern.ch" 
on_node_uuid="bd27618a-860b-43e8-93a1-6706b79fbea5" 
transition-key="2:15:0:ab8d208d-09e4-4b10-b5db-103808bad101"><attributes 
CRM_meta_op_no_wait="true" crm_feature_set="3.0.1"/></rsc_op></crm_xml>
client_gen=4
(1)destuuid=vSdhioYLQ+iToWcGt5++pQ==
(1)srcuuid=vSdhioYLQ+iToWcGt5++pQ==
<<<
.>>>
t=NS_ackmsg
dest=store07
ackseq=42ec220f
(1)destuuid=vSdhioYLQ+iToWcGt5++pQ==
src=store07
(1)srcuuid=vSdhioYLQ+iToWcGt5++pQ==
hg=4d8f206f
ts=4d999a1a
ttl=3
auth=1 30635c8380404531c17c874c0bbda22d
<<<

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to