[ 
https://issues.apache.org/jira/browse/GEODE-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Burcham updated GEODE-8809:
--------------------------------
    Description: 
We see this characteristic failure in a number of proprietary applications:
 * member stops sending heartbeats
 * The coordinator is requesting availability test from a member, 
 * member gets it after a delay
 * the delay causes the server to be kicked out (receives 
FordedDisconnectException)
 * operations fail.
 * server reconnects.

Usually when the failure detector/health monitor kicks a member out of the 
distributed system it is for one of these reasons:

1. Member really was malfunctioning or unreachable (i.e. something outside of 
health monitoring had a problem)

  a. Network problems

    i. Partition: 2-way, N-way

    ii. Slowdown or error rate increase

  b. CPU was over-taxed in faulty member. We see gaps on the order of 10s or 
more in heartbeat generation on that member.

    i. Geode was running in a virtualized environment and the virtualization 
system didn’t give the Geode process sufficient CPU

    ii. JVM memory was over-utilized so garbage collection (pauses) took too 
long

    iii. There was simply too much CPU demand and the product failed to reserve 
enough CPU capacity to keep the heartbeat going

This ticket captures situations where the failure detector causes a member to 
be kicked out *but we cannot prove definitively that any of these as a root 
cause*.

  was:
* member stops sending heartbeats
 * The coordinator is requesting availability test from a member, 
 * member gets it after a delay
 * the delay causes the server to be kicked out.
 * operations fail.
 * server reconnects.

Usually when the failure detector/health monitor kicks a member out of the 
distributed system it is for one of these reasons:

1. Member really was malfunctioning or unreachable (i.e. something outside of 
health monitoring had a problem)

  a. Network problems

    i. Partition: 2-way, N-way

    ii. Slowdown or error rate increase

  b. CPU was over-taxed in faulty member. We see gaps on the order of 10s or 
more in heartbeat generation on that member.

    i. Geode was running in a virtualized environment and the virtualization 
system didn’t give the Geode process sufficient CPU

    ii. JVM memory was over-utilized so garbage collection (pauses) took too 
long

    iii. There was simply too much CPU demand and the product failed to reserve 
enough CPU capacity to keep the heartbeat going


This ticket captures situations where the failure detector causes a member to 
be kicked out *but we cannot prove definitively that any of these as a root 
cause*.


> Servers are missing heartbeats from a member
> --------------------------------------------
>
>                 Key: GEODE-8809
>                 URL: https://issues.apache.org/jira/browse/GEODE-8809
>             Project: Geode
>          Issue Type: Bug
>          Components: messaging
>            Reporter: Nabarun Nag
>            Assignee: Bill Burcham
>            Priority: Major
>              Labels: blocks-1.14.0​
>
> We see this characteristic failure in a number of proprietary applications:
>  * member stops sending heartbeats
>  * The coordinator is requesting availability test from a member, 
>  * member gets it after a delay
>  * the delay causes the server to be kicked out (receives 
> FordedDisconnectException)
>  * operations fail.
>  * server reconnects.
> Usually when the failure detector/health monitor kicks a member out of the 
> distributed system it is for one of these reasons:
> 1. Member really was malfunctioning or unreachable (i.e. something outside of 
> health monitoring had a problem)
>   a. Network problems
>     i. Partition: 2-way, N-way
>     ii. Slowdown or error rate increase
>   b. CPU was over-taxed in faulty member. We see gaps on the order of 10s or 
> more in heartbeat generation on that member.
>     i. Geode was running in a virtualized environment and the virtualization 
> system didn’t give the Geode process sufficient CPU
>     ii. JVM memory was over-utilized so garbage collection (pauses) took too 
> long
>     iii. There was simply too much CPU demand and the product failed to 
> reserve enough CPU capacity to keep the heartbeat going
> This ticket captures situations where the failure detector causes a member to 
> be kicked out *but we cannot prove definitively that any of these as a root 
> cause*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to