[
https://issues.apache.org/jira/browse/GEODE-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bill Burcham updated GEODE-8809:
--------------------------------
Description:
* member stops sending heartbeats
* The coordinator is requesting availability test from a member,
* member gets it after a delay
* the delay causes the server to be kicked out.
* operations fail.
* server reconnects.
Usually when the failure detector/health monitor kicks a member out of the
distributed system it is for one of these reasons:
# Member really was malfunctioning or unreachable (i.e. something outside of
health monitoring had a problem)
## Network problems
### Partition: 2-way, N-way
### Slowdown or error rate increase
## CPU was over-taxed in faulty member. We see gaps on the order of 10s or
more in heartbeat generation on that member.
### Geode was running in a virtualized environment and the virtualization
system didn’t give the Geode process sufficient CPU
### JVM memory was over-utilized so garbage collection (pauses) took too long
### There was simply too much CPU demand and the product failed to reserve
enough CPU capacity to keep the heartbeat going
This ticket captures situations where the failure detector causes a member to
be kicked out *but we cannot prove definitively that any of these as a root
cause*.
was:
* The coordinator is requesting availability test from a member,
* member gets it after a delay
* the delay causes the server to be kicked out.
* operations fail.
* server reconnects.
We need figure out why the delay occurs, handle the disconnect.
> Servers are missing heartbeats from a member
> --------------------------------------------
>
> Key: GEODE-8809
> URL: https://issues.apache.org/jira/browse/GEODE-8809
> Project: Geode
> Issue Type: Bug
> Components: messaging
> Reporter: Nabarun Nag
> Assignee: Bill Burcham
> Priority: Major
> Labels: blocks-1.14.0
>
> * member stops sending heartbeats
> * The coordinator is requesting availability test from a member,
> * member gets it after a delay
> * the delay causes the server to be kicked out.
> * operations fail.
> * server reconnects.
> Usually when the failure detector/health monitor kicks a member out of the
> distributed system it is for one of these reasons:
> # Member really was malfunctioning or unreachable (i.e. something outside of
> health monitoring had a problem)
> ## Network problems
> ### Partition: 2-way, N-way
> ### Slowdown or error rate increase
> ## CPU was over-taxed in faulty member. We see gaps on the order of 10s or
> more in heartbeat generation on that member.
> ### Geode was running in a virtualized environment and the virtualization
> system didn’t give the Geode process sufficient CPU
> ### JVM memory was over-utilized so garbage collection (pauses) took too long
> ### There was simply too much CPU demand and the product failed to reserve
> enough CPU capacity to keep the heartbeat going
> This ticket captures situations where the failure detector causes a member to
> be kicked out *but we cannot prove definitively that any of these as a root
> cause*.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)