-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38912/
-----------------------------------------------------------

Review request for geode, anilkumar gingade, Jason Huynh, Jianxia Chen, and 
Lynn Gallinat.


Repository: geode


Description
-------

Network failure handling was not properly shutting down TCPConduit, leaving 
threads hanging trying to send messages.  The shutdown code was calling 
Services.emergencyClose too soon, and the recursion back into 
GMSMembershipManager shutdown code caused some problems, too.

GMSHealthMonitor was continually switching between two members to watch even 
though it had already sent suspect messages about them and had received no 
response.  I added a collection of IDs that are in this state and modified 
setNextNeighbor to avoid reusing them.

GMSHealthMonitor was sending removeMember messages to the locators and a random 
member, but for some reason this wasn't resolving a network partition fast 
enough.  I've disabled that behavior for now, sending the messages to all 
members.  This needs to be revisited because sending the message to all members 
is not scalable.

GMSHealthMonitor had some issues with initiating removals when it was in the 
process of shutting down.  I added some isStopping checks to fix this.

MembershipJUnitTest and StatRecorderJUnitTest were failing in gradle runs but 
not under Eclipse because my Eclipse launch configuration wasn't set to enable 
assertions.  After fixing that I found a number of problems with these tests 
and fixed them.

Multicast tests are now implemented in GMSMembershipManager and 
JGroupsMessenger.  This leverages the ping/pong messaging added for the quorum 
checker.

GMSJoinLeave was too slow in sending out new views when there were process 
failures.  I added code to inform the reply processor if there are queued 
leave/remove requests so it wouldn't wait for these, and also added similar 
checks in the removeHealthyMembers method (which performs checks on members 
using the HealthMonitor).

When there is a network partition GMSJoinLeave will now send a 
NetworkPartitionMessage to other members to prod them along in figuring out 
that they should shut down.

During a forced-disconnect there can be a lot of warning/fatail log messages.  
If there are alert listeners in the system this can create a lot of network 
traffic and extra work figuring out whether the receiver is even there or not.  
GMSMembershipManager now throws away outbound alerts when a forced-disconnect 
is in process.

Some of the forced-disconnect shutdown processing has been moved out of the 
membershp manager's DisconnectThread that was introduced with the quorum 
checker in order to set the shutdown cause, etc, as quickly as possible.

I noticed a lot of TXState log messages at debug level with a Throwable stack 
trace.  There was no comment saying why this was being done so I commented it 
out.

JGroups logging level is now set to FATAL by default.  The default log level 
was a problem during network partitions because each message send was causing a 
dire warning to be logged.

I observed a number of threads being left behind when a locator failed to start 
during auto-reconnect testing.  I added a unit test to LocatorJUnitTest for 
this and fixed the leaks.


Diffs
-----

  
gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/InternalDistributedSystem.java
 c3929c007ea69b15759b5b8480a32e3294cd6d73 
  
gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/InternalLocator.java
 6ea54e2a124410fedb8156a3757b79ea3de52174 
  
gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/NetView.java
 65fe913b8200e18249334d1e55acf7a67455c247 
  
gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/Services.java
 acd2bedfa9583a37446712d08ef04671f291378a 
  
gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/fd/GMSHealthMonitor.java
 f12628aeaa9a5874da8a09db846b4dc653978f99 
  
gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/interfaces/Messenger.java
 b154403ce12ff87576c0f7ca01732b1377f9712b 
  
gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/membership/GMSJoinLeave.java
 7b6b97df54148985ed6154823eefcf7d3ca82c23 
  
gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messages/NetworkPartitionMessage.java
 PRE-CREATION 
  
gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messages/SuspectMembersMessage.java
 117f440325ceab7131c4f5e153f32105a55b7b09 
  
gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messenger/JGroupsMessenger.java
 c1acb87cc184447dbd1879d2c4a569c7a8093dda 
  
gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messenger/StatRecorder.java
 1fef0daec35ab999829f58fc44da03851a852b7f 
  
gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/mgr/GMSMembershipManager.java
 64dd1cd5de028b296f0fd6bf33e02ffbf672cf6e 
  gemfire-core/src/main/java/com/gemstone/gemfire/internal/DSFIDFactory.java 
a743c8a9f2d227143f04081b11c4a42d9dcb61c2 
  
gemfire-core/src/main/java/com/gemstone/gemfire/internal/DataSerializableFixedID.java
 39fdeef81856d5ff128ed6ea050d4afbc3a612f7 
  gemfire-core/src/main/java/com/gemstone/gemfire/internal/cache/TXState.java 
2672323cc89c8266df943de4dc444984c66ca3af 
  
gemfire-core/src/main/resources/com/gemstone/gemfire/internal/logging/log4j/log4j2-default.xml
 8b1331ffda0ff7a3a1878ac491f9e394821f8ec1 
  
gemfire-core/src/test/java/com/gemstone/gemfire/distributed/LocatorDUnitTest.java
 afb4687d8d75b6f36f2c6900352c4d51b13b28c0 
  
gemfire-core/src/test/java/com/gemstone/gemfire/distributed/LocatorJUnitTest.java
 5a09b5589c63a8ac9e9b4883925ef3627e2066a9 
  
gemfire-core/src/test/java/com/gemstone/gemfire/distributed/internal/membership/MembershipJUnitTest.java
 f7683f9d0c4a1ca1bfd451fd9d0b7fcdc37c10ad 
  
gemfire-core/src/test/java/com/gemstone/gemfire/distributed/internal/membership/gms/membership/GMSJoinLeaveJUnitTest.java
 0af47a7904a85bd5c3efa98f1a398a43486d425f 
  
gemfire-core/src/test/java/com/gemstone/gemfire/distributed/internal/membership/gms/membership/StatRecorderJUnitTest.java
 fb502908b7c1bc7a32dfb367d1cdad56997305bb 
  
gemfire-core/src/test/java/com/gemstone/gemfire/internal/cache/partitioned/Bug43684DUnitTest.java
 9722311b4a13f90c94dc63d9eef3091c77d81ad8 

Diff: https://reviews.apache.org/r/38912/diff/


Testing
-------

precheckin, 3-host network partition testing


Thanks,

Bruce Schuchardt

Reply via email to