[Pacemaker] Corosync 1.4.7: zombie (defunct)

Sergey Arlashin Mon, 29 Dec 2014 03:45:36 -0800

Hi!
Recently I've noticed that one of my nodes had OFFLINE status in 'crm status' 
output. But it actually was not. I could ssh on this node. I could get 'crm 
status' from that node's console. After some time it became online. It happened 
several times without any obvious reason with other nodes.


Still no error of fatal messages in logs. The only warning messages I could get 
from corosync.log were the following:

Dec 29 10:56:34 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1346 
-> 0.233.1347 not applied to 0.233.1354: current "num_updates" is greater than 
required
Dec 29 10:56:34 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1347 
-> 0.233.1348 not applied to 0.233.1354: current "num_updates" is greater than 
required
Dec 29 10:56:34 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1348 
-> 0.233.1349 not applied to 0.233.1354: current "num_updates" is greater than 
required
Dec 29 10:56:34 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1349 
-> 0.233.1350 not applied to 0.233.1354: current "num_updates" is greater than 
required
Dec 29 10:56:34 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1350 
-> 0.233.1351 not applied to 0.233.1354: current "num_updates" is greater than 
required
Dec 29 10:56:34 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1351 
-> 0.233.1352 not applied to 0.233.1354: current "num_updates" is greater than 
required
Dec 29 10:56:34 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1352 
-> 0.233.1353 not applied to 0.233.1354: current "num_updates" is greater than 
required
Dec 29 10:56:34 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1353 
-> 0.233.1354 not applied to 0.233.1354: current "num_updates" is greater than 
required
Dec 29 10:56:34 lb-node2 attrd: [2240]: WARN: attrd_cib_callback: Update 491 
for last-failure-Cachier=1419729443 failed: Application of an update diff failed
Dec 29 10:56:34 lb-node2 attrd: [2240]: WARN: attrd_cib_callback: Update 494 
for fail-count-Cachier=1 failed: Application of an update diff failed
Dec 29 10:56:34 lb-node2 attrd: [2240]: WARN: attrd_cib_callback: Update 497 
for probe_complete=true failed: Application of an update diff failed
Dec 29 10:56:34 lb-node2 attrd: [2240]: WARN: attrd_cib_callback: Update 500 
for last-failure-Cachier=1419729443 failed: Application of an update diff failed
Dec 29 10:56:34 lb-node2 attrd: [2240]: WARN: attrd_cib_callback: Update 503 
for fail-count-Cachier=1 failed: Application of an update diff failed
Dec 29 10:56:37 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1338 
-> 0.233.1339 not applied to 0.233.1382: current "num_updates" is greater than 
required
Dec 29 10:56:37 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1339 
-> 0.233.1340 not applied to 0.233.1382: current "num_updates" is greater than 
required
Dec 29 10:56:37 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1340 
-> 0.233.1341 not applied to 0.233.1382: current "num_updates" is greater than 
required
Dec 29 10:56:37 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1341 
-> 0.233.1342 not applied to 0.233.1382: current "num_updates" is greater than 
required
Dec 29 10:56:37 lb-node2 cib: [2238]: WARN: cib_process_diff: Diff 0.233.1342 
-> 0.233.1343 not applied to 0.233.1382: current "num_updates" is greater than 
required

After exploring corosync processes with ps I found out that on all my nodes 
there are zombie corosync procs like:

root     13892  0.0  0.0      0     0 ?        Z    Dec26   0:04 [corosync] 
<defunct>
root     21793  0.0  0.0      0     0 ?        Z    Dec26   0:00 [corosync] 
<defunct>
root     27009  1.3  1.0 714292 10784 ?        Ssl  Dec18 223:38 
/usr/sbin/corosync

Is it ok to have zombie corosync procs on nodes? Or does it suggest that 
something wrong is going on ? 

Thanks in advance

--
Best regards,
Sergey Arlashin





_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] Corosync 1.4.7: zombie (defunct)

Reply via email to