I constantly see TimedOutException , then followed by
UnavailableException in my logs,
so I added some extra debugging to Gossiper. notifyFailureDetector()



    void notifyFailureDetector(InetAddress endpoint, EndpointState
remoteEndpointState)
    {
        IFailureDetector fd = FailureDetector.instance;
        EndpointState localEndpointState = endpointStateMap.get(endpoint);
        logger.debug("notify failure detector");
        /*
         * If the local endpoint state exists then report to the FD only
         * if the versions workout.
        */
        if ( localEndpointState != null )
        {
                logger.debug("notify failure detector, endpoint");
            int localGeneration =
localEndpointState.getHeartBeatState().getGeneration();
            int remoteGeneration =
remoteEndpointState.getHeartBeatState().getGeneration();
            if ( remoteGeneration > localGeneration )
            {
                localEndpointState.updateTimestamp();
                logger.debug("notify failure detector --- report 1");
                fd.report(endpoint);
                return;
            }




then I found that this method stopped being called for a period of 3
minutes, so of course the detector considers the other side to be
dead.

but since these 2 boxes are in the same EC2 region, same security
group, there is no reason there is a network issue that long. so I
ran a background job that just does

echo | nc $the_other_box 7000   in a loop

and this always works fine, without failing to contact the 7000 port.


so somehow the messages were not delivered or received, how could I debug this?
(extra logging attached)

Thanks
Yang

Attachment: ss
Description: Binary data

Reply via email to