> On Oct. 1, 2015, 1:30 a.m., anilkumar gingade wrote: > > gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/InternalDistributedSystem.java, > > line 1327 > > <https://reviews.apache.org/r/38912/diff/1/?file=1088150#file1088150line1327> > > > > we are passing the same value for both the arguments...Is this expected?
I will change this to (this.forcedDisconnect, preparingForReconnect, false) - Bruce ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/38912/#review101184 ----------------------------------------------------------- On Sept. 30, 2015, 11:59 p.m., Bruce Schuchardt wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/38912/ > ----------------------------------------------------------- > > (Updated Sept. 30, 2015, 11:59 p.m.) > > > Review request for geode, anilkumar gingade, Jason Huynh, Jianxia Chen, and > Lynn Gallinat. > > > Repository: geode > > > Description > ------- > > Network failure handling was not properly shutting down TCPConduit, leaving > threads hanging trying to send messages. The shutdown code was calling > Services.emergencyClose too soon, and the recursion back into > GMSMembershipManager shutdown code caused some problems, too. > > GMSHealthMonitor was continually switching between two members to watch even > though it had already sent suspect messages about them and had received no > response. I added a collection of IDs that are in this state and modified > setNextNeighbor to avoid reusing them. > > GMSHealthMonitor was sending removeMember messages to the locators and a > random member, but for some reason this wasn't resolving a network partition > fast enough. I've disabled that behavior for now, sending the messages to > all members. This needs to be revisited because sending the message to all > members is not scalable. > > GMSHealthMonitor had some issues with initiating removals when it was in the > process of shutting down. I added some isStopping checks to fix this. > > MembershipJUnitTest and StatRecorderJUnitTest were failing in gradle runs but > not under Eclipse because my Eclipse launch configuration wasn't set to > enable assertions. After fixing that I found a number of problems with these > tests and fixed them. > > Multicast tests are now implemented in GMSMembershipManager and > JGroupsMessenger. This leverages the ping/pong messaging added for the > quorum checker. > > GMSJoinLeave was too slow in sending out new views when there were process > failures. I added code to inform the reply processor if there are queued > leave/remove requests so it wouldn't wait for these, and also added similar > checks in the removeHealthyMembers method (which performs checks on members > using the HealthMonitor). > > When there is a network partition GMSJoinLeave will now send a > NetworkPartitionMessage to other members to prod them along in figuring out > that they should shut down. > > During a forced-disconnect there can be a lot of warning/fatail log messages. > If there are alert listeners in the system this can create a lot of network > traffic and extra work figuring out whether the receiver is even there or > not. GMSMembershipManager now throws away outbound alerts when a > forced-disconnect is in process. > > Some of the forced-disconnect shutdown processing has been moved out of the > membershp manager's DisconnectThread that was introduced with the quorum > checker in order to set the shutdown cause, etc, as quickly as possible. > > I noticed a lot of TXState log messages at debug level with a Throwable stack > trace. There was no comment saying why this was being done so I commented it > out. > > JGroups logging level is now set to FATAL by default. The default log level > was a problem during network partitions because each message send was causing > a dire warning to be logged. > > I observed a number of threads being left behind when a locator failed to > start during auto-reconnect testing. I added a unit test to LocatorJUnitTest > for this and fixed the leaks. > > > Diffs > ----- > > > gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/InternalDistributedSystem.java > c3929c007ea69b15759b5b8480a32e3294cd6d73 > > gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/InternalLocator.java > 6ea54e2a124410fedb8156a3757b79ea3de52174 > > gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/NetView.java > 65fe913b8200e18249334d1e55acf7a67455c247 > > gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/Services.java > acd2bedfa9583a37446712d08ef04671f291378a > > gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/fd/GMSHealthMonitor.java > f12628aeaa9a5874da8a09db846b4dc653978f99 > > gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/interfaces/Messenger.java > b154403ce12ff87576c0f7ca01732b1377f9712b > > gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/membership/GMSJoinLeave.java > 7b6b97df54148985ed6154823eefcf7d3ca82c23 > > gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messages/NetworkPartitionMessage.java > PRE-CREATION > > gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messages/SuspectMembersMessage.java > 117f440325ceab7131c4f5e153f32105a55b7b09 > > gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messenger/JGroupsMessenger.java > c1acb87cc184447dbd1879d2c4a569c7a8093dda > > gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/messenger/StatRecorder.java > 1fef0daec35ab999829f58fc44da03851a852b7f > > gemfire-core/src/main/java/com/gemstone/gemfire/distributed/internal/membership/gms/mgr/GMSMembershipManager.java > 64dd1cd5de028b296f0fd6bf33e02ffbf672cf6e > gemfire-core/src/main/java/com/gemstone/gemfire/internal/DSFIDFactory.java > a743c8a9f2d227143f04081b11c4a42d9dcb61c2 > > gemfire-core/src/main/java/com/gemstone/gemfire/internal/DataSerializableFixedID.java > 39fdeef81856d5ff128ed6ea050d4afbc3a612f7 > gemfire-core/src/main/java/com/gemstone/gemfire/internal/cache/TXState.java > 2672323cc89c8266df943de4dc444984c66ca3af > > gemfire-core/src/main/resources/com/gemstone/gemfire/internal/logging/log4j/log4j2-default.xml > 8b1331ffda0ff7a3a1878ac491f9e394821f8ec1 > > gemfire-core/src/test/java/com/gemstone/gemfire/distributed/LocatorDUnitTest.java > afb4687d8d75b6f36f2c6900352c4d51b13b28c0 > > gemfire-core/src/test/java/com/gemstone/gemfire/distributed/LocatorJUnitTest.java > 5a09b5589c63a8ac9e9b4883925ef3627e2066a9 > > gemfire-core/src/test/java/com/gemstone/gemfire/distributed/internal/membership/MembershipJUnitTest.java > f7683f9d0c4a1ca1bfd451fd9d0b7fcdc37c10ad > > gemfire-core/src/test/java/com/gemstone/gemfire/distributed/internal/membership/gms/membership/GMSJoinLeaveJUnitTest.java > 0af47a7904a85bd5c3efa98f1a398a43486d425f > > gemfire-core/src/test/java/com/gemstone/gemfire/distributed/internal/membership/gms/membership/StatRecorderJUnitTest.java > fb502908b7c1bc7a32dfb367d1cdad56997305bb > > gemfire-core/src/test/java/com/gemstone/gemfire/internal/cache/partitioned/Bug43684DUnitTest.java > 9722311b4a13f90c94dc63d9eef3091c77d81ad8 > > Diff: https://reviews.apache.org/r/38912/diff/ > > > Testing > ------- > > precheckin, 3-host network partition testing > > > Thanks, > > Bruce Schuchardt > >
