[
https://issues.apache.org/jira/browse/GEODE-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bill Burcham updated GEODE-8809:
--------------------------------
Description:
We see this characteristic failure in a number of proprietary applications:
* member stops sending heartbeats
* The coordinator is requesting availability test from a member,
* member gets it after a delay
* the delay causes the server to be kicked out (receives
FordedDisconnectException)
* operations fail.
* server reconnects.
Usually when the failure detector/health monitor kicks a member out of the
distributed system it is for one of these reasons:
1. Member really was malfunctioning or unreachable (i.e. something outside of
health monitoring had a problem)
a. Network problems
i. Partition: 2-way, N-way
ii. Slowdown or error rate increase
b. CPU was over-taxed in faulty member. We see gaps on the order of 10s or
more in heartbeat generation on that member.
i. Geode was running in a virtualized environment and the virtualization
system didn’t give the Geode process sufficient CPU
ii. JVM memory was over-utilized so garbage collection (pauses) took too
long
iii. There was simply too much CPU demand and the product failed to reserve
enough CPU capacity to keep the heartbeat going
This ticket captures situations where the failure detector causes a member to
be kicked out *but we cannot prove definitively that any of these as a root
cause*.
was:
* member stops sending heartbeats
* The coordinator is requesting availability test from a member,
* member gets it after a delay
* the delay causes the server to be kicked out.
* operations fail.
* server reconnects.
Usually when the failure detector/health monitor kicks a member out of the
distributed system it is for one of these reasons:
1. Member really was malfunctioning or unreachable (i.e. something outside of
health monitoring had a problem)
a. Network problems
i. Partition: 2-way, N-way
ii. Slowdown or error rate increase
b. CPU was over-taxed in faulty member. We see gaps on the order of 10s or
more in heartbeat generation on that member.
i. Geode was running in a virtualized environment and the virtualization
system didn’t give the Geode process sufficient CPU
ii. JVM memory was over-utilized so garbage collection (pauses) took too
long
iii. There was simply too much CPU demand and the product failed to reserve
enough CPU capacity to keep the heartbeat going
This ticket captures situations where the failure detector causes a member to
be kicked out *but we cannot prove definitively that any of these as a root
cause*.
> Servers are missing heartbeats from a member
> --------------------------------------------
>
> Key: GEODE-8809
> URL: https://issues.apache.org/jira/browse/GEODE-8809
> Project: Geode
> Issue Type: Bug
> Components: messaging
> Reporter: Nabarun Nag
> Assignee: Bill Burcham
> Priority: Major
> Labels: blocks-1.14.0
>
> We see this characteristic failure in a number of proprietary applications:
> * member stops sending heartbeats
> * The coordinator is requesting availability test from a member,
> * member gets it after a delay
> * the delay causes the server to be kicked out (receives
> FordedDisconnectException)
> * operations fail.
> * server reconnects.
> Usually when the failure detector/health monitor kicks a member out of the
> distributed system it is for one of these reasons:
> 1. Member really was malfunctioning or unreachable (i.e. something outside of
> health monitoring had a problem)
> a. Network problems
> i. Partition: 2-way, N-way
> ii. Slowdown or error rate increase
> b. CPU was over-taxed in faulty member. We see gaps on the order of 10s or
> more in heartbeat generation on that member.
> i. Geode was running in a virtualized environment and the virtualization
> system didn’t give the Geode process sufficient CPU
> ii. JVM memory was over-utilized so garbage collection (pauses) took too
> long
> iii. There was simply too much CPU demand and the product failed to
> reserve enough CPU capacity to keep the heartbeat going
> This ticket captures situations where the failure detector causes a member to
> be kicked out *but we cannot prove definitively that any of these as a root
> cause*.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)