I constantly see TimedOutException , then followed by UnavailableException in my logs, so I added some extra debugging to Gossiper. notifyFailureDetector()
void notifyFailureDetector(InetAddress endpoint, EndpointState remoteEndpointState) { IFailureDetector fd = FailureDetector.instance; EndpointState localEndpointState = endpointStateMap.get(endpoint); logger.debug("notify failure detector"); /* * If the local endpoint state exists then report to the FD only * if the versions workout. */ if ( localEndpointState != null ) { logger.debug("notify failure detector, endpoint"); int localGeneration = localEndpointState.getHeartBeatState().getGeneration(); int remoteGeneration = remoteEndpointState.getHeartBeatState().getGeneration(); if ( remoteGeneration > localGeneration ) { localEndpointState.updateTimestamp(); logger.debug("notify failure detector --- report 1"); fd.report(endpoint); return; } then I found that this method stopped being called for a period of 3 minutes, so of course the detector considers the other side to be dead. but since these 2 boxes are in the same EC2 region, same security group, there is no reason there is a network issue that long. so I ran a background job that just does echo | nc $the_other_box 7000 in a loop and this always works fine, without failing to contact the 7000 port. so somehow the messages were not delivered or received, how could I debug this? (extra logging attached) Thanks Yang
ss
Description: Binary data