thanks Jonathan,

I really don't know, I just did further tests to catch the jstack on
the receiving side over the last night. going through these stacks
now.  if I can't find anything suspicious, I'll add these debugging to
the sending side too.

another useful piece of info: when I did a single-node setup, I also
found a lot of TimedOutException, similar to what I see with the
2-node setup. I think I didn't see the UnavailableException, probably
because it's just a single node, and the node always believes itself
to be available.

I think GC issue is not the culprit here, since I don't see any length
GC logging around when the delay is happening. no compaction/flushing
either



On Sun, Sep 25, 2011 at 6:33 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
> What makes you think the problem is on the receiving node, rather than
> the sending node?
>
> On Sun, Sep 25, 2011 at 1:19 AM, Yang <teddyyyy...@gmail.com> wrote:
>> I constantly see TimedOutException , then followed by
>> UnavailableException in my logs,
>> so I added some extra debugging to Gossiper. notifyFailureDetector()
>>
>>
>>
>>    void notifyFailureDetector(InetAddress endpoint, EndpointState
>> remoteEndpointState)
>>    {
>>        IFailureDetector fd = FailureDetector.instance;
>>        EndpointState localEndpointState = endpointStateMap.get(endpoint);
>>        logger.debug("notify failure detector");
>>        /*
>>         * If the local endpoint state exists then report to the FD only
>>         * if the versions workout.
>>        */
>>        if ( localEndpointState != null )
>>        {
>>                logger.debug("notify failure detector, endpoint");
>>            int localGeneration =
>> localEndpointState.getHeartBeatState().getGeneration();
>>            int remoteGeneration =
>> remoteEndpointState.getHeartBeatState().getGeneration();
>>            if ( remoteGeneration > localGeneration )
>>            {
>>                localEndpointState.updateTimestamp();
>>                logger.debug("notify failure detector --- report 1");
>>                fd.report(endpoint);
>>                return;
>>            }
>>
>>
>>
>>
>> then I found that this method stopped being called for a period of 3
>> minutes, so of course the detector considers the other side to be
>> dead.
>>
>> but since these 2 boxes are in the same EC2 region, same security
>> group, there is no reason there is a network issue that long. so I
>> ran a background job that just does
>>
>> echo | nc $the_other_box 7000   in a loop
>>
>> and this always works fine, without failing to contact the 7000 port.
>>
>>
>> so somehow the messages were not delivered or received, how could I debug 
>> this?
>> (extra logging attached)
>>
>> Thanks
>> Yang
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Reply via email to