What makes you think the problem is on the receiving node, rather than the sending node?
On Sun, Sep 25, 2011 at 1:19 AM, Yang <teddyyyy...@gmail.com> wrote: > I constantly see TimedOutException , then followed by > UnavailableException in my logs, > so I added some extra debugging to Gossiper. notifyFailureDetector() > > > > void notifyFailureDetector(InetAddress endpoint, EndpointState > remoteEndpointState) > { > IFailureDetector fd = FailureDetector.instance; > EndpointState localEndpointState = endpointStateMap.get(endpoint); > logger.debug("notify failure detector"); > /* > * If the local endpoint state exists then report to the FD only > * if the versions workout. > */ > if ( localEndpointState != null ) > { > logger.debug("notify failure detector, endpoint"); > int localGeneration = > localEndpointState.getHeartBeatState().getGeneration(); > int remoteGeneration = > remoteEndpointState.getHeartBeatState().getGeneration(); > if ( remoteGeneration > localGeneration ) > { > localEndpointState.updateTimestamp(); > logger.debug("notify failure detector --- report 1"); > fd.report(endpoint); > return; > } > > > > > then I found that this method stopped being called for a period of 3 > minutes, so of course the detector considers the other side to be > dead. > > but since these 2 boxes are in the same EC2 region, same security > group, there is no reason there is a network issue that long. so I > ran a background job that just does > > echo | nc $the_other_box 7000 in a loop > > and this always works fine, without failing to contact the 7000 port. > > > so somehow the messages were not delivered or received, how could I debug > this? > (extra logging attached) > > Thanks > Yang > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com