hmmmm. never mind, possibly the first 24 seconds delay was caused by GC,
the GC logging was not printed in system.log,
I found one line on stdout that possibly corresponds to that.


I found I left out the enable parallel remark  param, let me add that and retry.
Thanks

Yang

On Mon, Sep 26, 2011 at 12:25 AM, Yang <teddyyyy...@gmail.com> wrote:
> ok, I got more detailed logging that seems to give some clue out of
> this myth ....
>
> I enabled TRACE level logging for Gossiper, also added a debug to
> MessageDeliveryTask so that every arrival msg shows up in logging.
>
> I have 2 nodes, all the requests come from the coordinator node,
> basically I see 2 possible issues.
>
> 1)
> in the front of bad_coord.txt, we see that the coordinator was
> receiving messages correctly until
>
> DEBUG [RequestResponseStage:22] 2011-09-26 05:43:35,521
> MessageDeliveryTask.java (line 60) delivering msg
>
>
> then the line right after it is
> DEBUG [RequestResponseStage:24] 2011-09-26 05:43:59,516
> MessageDeliveryTask.java (line 60) delivering msg
>
> so there is a 24 seconds delay , in this period, no messages were
> delivered on the coordinator. this is enough to trigger a Timeout and
> Unavailable. indeed we see time out  and UnavailableException later in
> the log.
>
>
> during this period, on the replica box, in bad_replica.txt we see
> correct message delivery up to
> DEBUG [ReadStage:107] 2011-09-26 05:43:35,214 MessageDeliveryTask.java
> (line 60) delivering msg
>
> then on the next line, it starts doing something different:
> TRACE [GossipTasks:1] 2011-09-26 05:43:35,570 Gossiper.java (line 125)
> My heartbeat is now 5721
> TRACE [GossipTasks:1] 2011-09-26 05:43:35,570 Gossiper.java (line 357)
> Gossip Digests are : /10.71.13.5:1317009996:5765
> ec2-50-19-172-99.compute-1.amazonaws.com/10.71.21.46:1317010027:5721
>
>
> I did not read the code in detail to understand what this means, but
> it seems to correspond with the "silent period" on the coord in terms
> of time. it keeps on doing this till
>  INFO [GossipTasks:1] 2011-09-26 05:43:53,638 Gossiper.java (line 721)
> InetAddress /10.71.13.5 is now dead.
>
> and
> DEBUG [MutationStage:55] 2011-09-26 05:43:59,208
> MessageDeliveryTask.java (line 60) delivering msg
>  INFO [GossipStage:1] 2011-09-26 05:43:59,211 Gossiper.java (line 707)
> InetAddress /10.71.13.5 is now UP
>
>
> ===== so, here the problem is that the messages stopped showing up
> from 05:43:35 to 05:43:59 on the coord, though the replica is trying
> to send them . not sure if this is due to network.
>
>
> 2) the other problem actually causes a longer delay on application side
> in the end of bad_coord.txt we see that the TimeoutException and
> AvailableException were printed out actually much later, after
> 06:06:33
> TRACE [GossipStage:1] 2011-09-26 06:06:33,107 Gossiper.java (line 802)
> /10.71.21.46local generation 1317010027, remote generation 1317010027
> TRACE [GossipStage:1] 2011-09-26 06:06:33,107 Gossiper.java (line 847)
> Updating heartbeat state version to 7119 from 7118 for /10.71.21.46
> ...
> ERROR [New I/O server worker #1-15] 2011-09-26 05:43:59,500
> MY_APPLICATION_CODE.java (line 234) :
> Caused by: TimedOutException()
>
> ** the main issue here is that the timestamp of the logging line
> itself says 05:43:59, which is correctly when the TimeOutException was
> generated**, but the line was actually not printed out till 06:06:33,
> 23minutes later. this causes a 23minutes pause visible to the clients.
>
> this 23minutes delay is much more damaging than the 24 seconds message
> drop period in 1). I wonder what could have caused this 23minutes
> delay. could be lock contention on grabbing the log4j appender ?? but
> it shouldn't be, since I only added the extra logging AFTER I saw the
> problem appear before.
>
>
> Thanks for your patience for reading thus far :)
>
>
> Yang
>
>
>
>
>
> On Sun, Sep 25, 2011 at 10:32 AM, Yang <teddyyyy...@gmail.com> wrote:
>> thanks Jonathan,
>>
>>
>> I really don't know, I just did further tests to catch the jstack on
>> the receiving side over the last night. going through these stacks
>> now.  if I can't find anything suspicious, I'll add these debugging to
>> the sending side too.
>>
>> another useful piece of info: when I did a single-node setup, I also
>> found a lot of TimedOutException, similar to what I see with the
>> 2-node setup. I think I didn't see the UnavailableException, probably
>> because it's just a single node, and the node always believes itself
>> to be available.
>>
>> I think GC issue is not the culprit here, since I don't see any length
>> GC logging around when the delay is happening. no compaction/flushing
>> either
>>
>>
>>
>> On Sun, Sep 25, 2011 at 6:33 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
>>> What makes you think the problem is on the receiving node, rather than
>>> the sending node?
>>>
>>> On Sun, Sep 25, 2011 at 1:19 AM, Yang <teddyyyy...@gmail.com> wrote:
>>>> I constantly see TimedOutException , then followed by
>>>> UnavailableException in my logs,
>>>> so I added some extra debugging to Gossiper. notifyFailureDetector()
>>>>
>>>>
>>>>
>>>>    void notifyFailureDetector(InetAddress endpoint, EndpointState
>>>> remoteEndpointState)
>>>>    {
>>>>        IFailureDetector fd = FailureDetector.instance;
>>>>        EndpointState localEndpointState = endpointStateMap.get(endpoint);
>>>>        logger.debug("notify failure detector");
>>>>        /*
>>>>         * If the local endpoint state exists then report to the FD only
>>>>         * if the versions workout.
>>>>        */
>>>>        if ( localEndpointState != null )
>>>>        {
>>>>                logger.debug("notify failure detector, endpoint");
>>>>            int localGeneration =
>>>> localEndpointState.getHeartBeatState().getGeneration();
>>>>            int remoteGeneration =
>>>> remoteEndpointState.getHeartBeatState().getGeneration();
>>>>            if ( remoteGeneration > localGeneration )
>>>>            {
>>>>                localEndpointState.updateTimestamp();
>>>>                logger.debug("notify failure detector --- report 1");
>>>>                fd.report(endpoint);
>>>>                return;
>>>>            }
>>>>
>>>>
>>>>
>>>>
>>>> then I found that this method stopped being called for a period of 3
>>>> minutes, so of course the detector considers the other side to be
>>>> dead.
>>>>
>>>> but since these 2 boxes are in the same EC2 region, same security
>>>> group, there is no reason there is a network issue that long. so I
>>>> ran a background job that just does
>>>>
>>>> echo | nc $the_other_box 7000   in a loop
>>>>
>>>> and this always works fine, without failing to contact the 7000 port.
>>>>
>>>>
>>>> so somehow the messages were not delivered or received, how could I debug 
>>>> this?
>>>> (extra logging attached)
>>>>
>>>> Thanks
>>>> Yang
>>>>
>>>
>>>
>>>
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of DataStax, the source for professional Cassandra support
>>> http://www.datastax.com
>>>
>>
>

Reply via email to