[
https://issues.apache.org/jira/browse/GEODE-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Murmann updated GEODE-8809:
-------------------------------------
Labels: (was: blocks-1.14.0)
> Member Stops Sending Heartbeats, CPU Saturation Cannot Be Proven
> ----------------------------------------------------------------
>
> Key: GEODE-8809
> URL: https://issues.apache.org/jira/browse/GEODE-8809
> Project: Geode
> Issue Type: Bug
> Components: messaging
> Reporter: Nabarun Nag
> Assignee: Bill Burcham
> Priority: Major
>
> We see this characteristic failure in a number of proprietary applications:
> * member stops sending heartbeats
> * The coordinator is requesting availability test from a member,
> * member gets it after a delay
> * the delay causes the server to be kicked out (receives
> FordedDisconnectException)
> * operations fail.
> * server reconnects.
> Usually when the failure detector/health monitor kicks a member out of the
> distributed system it is for one of these reasons:
> 1. Member really was malfunctioning or unreachable (i.e. something outside of
> health monitoring had a problem)
> a. Network problems
> i. Partition: 2-way, N-way
> ii. Slowdown or error rate increase
> b. CPU was over-taxed in faulty member. We see gaps on the order of 10s or
> more in heartbeat generation on that member.
> i. Geode was running in a virtualized environment and the virtualization
> system didn’t give the Geode process sufficient CPU
> ii. JVM memory was over-utilized so garbage collection (pauses) took too
> long
> iii. There was simply too much CPU demand and the product failed to
> reserve enough CPU capacity to keep the heartbeat going
> This ticket captures situations where the failure detector causes a member to
> be kicked out *but we cannot prove definitively that any of these as a root
> cause*.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)