thanks Jonathan,
I really don't know, I just did further tests to catch the jstack on the receiving side over the last night. going through these stacks now. if I can't find anything suspicious, I'll add these debugging to the sending side too. another useful piece of info: when I did a single-node setup, I also found a lot of TimedOutException, similar to what I see with the 2-node setup. I think I didn't see the UnavailableException, probably because it's just a single node, and the node always believes itself to be available. I think GC issue is not the culprit here, since I don't see any length GC logging around when the delay is happening. no compaction/flushing either On Sun, Sep 25, 2011 at 6:33 AM, Jonathan Ellis <jbel...@gmail.com> wrote: > What makes you think the problem is on the receiving node, rather than > the sending node? > > On Sun, Sep 25, 2011 at 1:19 AM, Yang <teddyyyy...@gmail.com> wrote: >> I constantly see TimedOutException , then followed by >> UnavailableException in my logs, >> so I added some extra debugging to Gossiper. notifyFailureDetector() >> >> >> >> void notifyFailureDetector(InetAddress endpoint, EndpointState >> remoteEndpointState) >> { >> IFailureDetector fd = FailureDetector.instance; >> EndpointState localEndpointState = endpointStateMap.get(endpoint); >> logger.debug("notify failure detector"); >> /* >> * If the local endpoint state exists then report to the FD only >> * if the versions workout. >> */ >> if ( localEndpointState != null ) >> { >> logger.debug("notify failure detector, endpoint"); >> int localGeneration = >> localEndpointState.getHeartBeatState().getGeneration(); >> int remoteGeneration = >> remoteEndpointState.getHeartBeatState().getGeneration(); >> if ( remoteGeneration > localGeneration ) >> { >> localEndpointState.updateTimestamp(); >> logger.debug("notify failure detector --- report 1"); >> fd.report(endpoint); >> return; >> } >> >> >> >> >> then I found that this method stopped being called for a period of 3 >> minutes, so of course the detector considers the other side to be >> dead. >> >> but since these 2 boxes are in the same EC2 region, same security >> group, there is no reason there is a network issue that long. so I >> ran a background job that just does >> >> echo | nc $the_other_box 7000 in a loop >> >> and this always works fine, without failing to contact the 7000 port. >> >> >> so somehow the messages were not delivered or received, how could I debug >> this? >> (extra logging attached) >> >> Thanks >> Yang >> > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com >