hmmmm. never mind, possibly the first 24 seconds delay was caused by GC, the GC logging was not printed in system.log, I found one line on stdout that possibly corresponds to that.
I found I left out the enable parallel remark param, let me add that and retry. Thanks Yang On Mon, Sep 26, 2011 at 12:25 AM, Yang <teddyyyy...@gmail.com> wrote: > ok, I got more detailed logging that seems to give some clue out of > this myth .... > > I enabled TRACE level logging for Gossiper, also added a debug to > MessageDeliveryTask so that every arrival msg shows up in logging. > > I have 2 nodes, all the requests come from the coordinator node, > basically I see 2 possible issues. > > 1) > in the front of bad_coord.txt, we see that the coordinator was > receiving messages correctly until > > DEBUG [RequestResponseStage:22] 2011-09-26 05:43:35,521 > MessageDeliveryTask.java (line 60) delivering msg > > > then the line right after it is > DEBUG [RequestResponseStage:24] 2011-09-26 05:43:59,516 > MessageDeliveryTask.java (line 60) delivering msg > > so there is a 24 seconds delay , in this period, no messages were > delivered on the coordinator. this is enough to trigger a Timeout and > Unavailable. indeed we see time out and UnavailableException later in > the log. > > > during this period, on the replica box, in bad_replica.txt we see > correct message delivery up to > DEBUG [ReadStage:107] 2011-09-26 05:43:35,214 MessageDeliveryTask.java > (line 60) delivering msg > > then on the next line, it starts doing something different: > TRACE [GossipTasks:1] 2011-09-26 05:43:35,570 Gossiper.java (line 125) > My heartbeat is now 5721 > TRACE [GossipTasks:1] 2011-09-26 05:43:35,570 Gossiper.java (line 357) > Gossip Digests are : /10.71.13.5:1317009996:5765 > ec2-50-19-172-99.compute-1.amazonaws.com/10.71.21.46:1317010027:5721 > > > I did not read the code in detail to understand what this means, but > it seems to correspond with the "silent period" on the coord in terms > of time. it keeps on doing this till > INFO [GossipTasks:1] 2011-09-26 05:43:53,638 Gossiper.java (line 721) > InetAddress /10.71.13.5 is now dead. > > and > DEBUG [MutationStage:55] 2011-09-26 05:43:59,208 > MessageDeliveryTask.java (line 60) delivering msg > INFO [GossipStage:1] 2011-09-26 05:43:59,211 Gossiper.java (line 707) > InetAddress /10.71.13.5 is now UP > > > ===== so, here the problem is that the messages stopped showing up > from 05:43:35 to 05:43:59 on the coord, though the replica is trying > to send them . not sure if this is due to network. > > > 2) the other problem actually causes a longer delay on application side > in the end of bad_coord.txt we see that the TimeoutException and > AvailableException were printed out actually much later, after > 06:06:33 > TRACE [GossipStage:1] 2011-09-26 06:06:33,107 Gossiper.java (line 802) > /10.71.21.46local generation 1317010027, remote generation 1317010027 > TRACE [GossipStage:1] 2011-09-26 06:06:33,107 Gossiper.java (line 847) > Updating heartbeat state version to 7119 from 7118 for /10.71.21.46 > ... > ERROR [New I/O server worker #1-15] 2011-09-26 05:43:59,500 > MY_APPLICATION_CODE.java (line 234) : > Caused by: TimedOutException() > > ** the main issue here is that the timestamp of the logging line > itself says 05:43:59, which is correctly when the TimeOutException was > generated**, but the line was actually not printed out till 06:06:33, > 23minutes later. this causes a 23minutes pause visible to the clients. > > this 23minutes delay is much more damaging than the 24 seconds message > drop period in 1). I wonder what could have caused this 23minutes > delay. could be lock contention on grabbing the log4j appender ?? but > it shouldn't be, since I only added the extra logging AFTER I saw the > problem appear before. > > > Thanks for your patience for reading thus far :) > > > Yang > > > > > > On Sun, Sep 25, 2011 at 10:32 AM, Yang <teddyyyy...@gmail.com> wrote: >> thanks Jonathan, >> >> >> I really don't know, I just did further tests to catch the jstack on >> the receiving side over the last night. going through these stacks >> now. if I can't find anything suspicious, I'll add these debugging to >> the sending side too. >> >> another useful piece of info: when I did a single-node setup, I also >> found a lot of TimedOutException, similar to what I see with the >> 2-node setup. I think I didn't see the UnavailableException, probably >> because it's just a single node, and the node always believes itself >> to be available. >> >> I think GC issue is not the culprit here, since I don't see any length >> GC logging around when the delay is happening. no compaction/flushing >> either >> >> >> >> On Sun, Sep 25, 2011 at 6:33 AM, Jonathan Ellis <jbel...@gmail.com> wrote: >>> What makes you think the problem is on the receiving node, rather than >>> the sending node? >>> >>> On Sun, Sep 25, 2011 at 1:19 AM, Yang <teddyyyy...@gmail.com> wrote: >>>> I constantly see TimedOutException , then followed by >>>> UnavailableException in my logs, >>>> so I added some extra debugging to Gossiper. notifyFailureDetector() >>>> >>>> >>>> >>>> void notifyFailureDetector(InetAddress endpoint, EndpointState >>>> remoteEndpointState) >>>> { >>>> IFailureDetector fd = FailureDetector.instance; >>>> EndpointState localEndpointState = endpointStateMap.get(endpoint); >>>> logger.debug("notify failure detector"); >>>> /* >>>> * If the local endpoint state exists then report to the FD only >>>> * if the versions workout. >>>> */ >>>> if ( localEndpointState != null ) >>>> { >>>> logger.debug("notify failure detector, endpoint"); >>>> int localGeneration = >>>> localEndpointState.getHeartBeatState().getGeneration(); >>>> int remoteGeneration = >>>> remoteEndpointState.getHeartBeatState().getGeneration(); >>>> if ( remoteGeneration > localGeneration ) >>>> { >>>> localEndpointState.updateTimestamp(); >>>> logger.debug("notify failure detector --- report 1"); >>>> fd.report(endpoint); >>>> return; >>>> } >>>> >>>> >>>> >>>> >>>> then I found that this method stopped being called for a period of 3 >>>> minutes, so of course the detector considers the other side to be >>>> dead. >>>> >>>> but since these 2 boxes are in the same EC2 region, same security >>>> group, there is no reason there is a network issue that long. so I >>>> ran a background job that just does >>>> >>>> echo | nc $the_other_box 7000 in a loop >>>> >>>> and this always works fine, without failing to contact the 7000 port. >>>> >>>> >>>> so somehow the messages were not delivered or received, how could I debug >>>> this? >>>> (extra logging attached) >>>> >>>> Thanks >>>> Yang >>>> >>> >>> >>> >>> -- >>> Jonathan Ellis >>> Project Chair, Apache Cassandra >>> co-founder of DataStax, the source for professional Cassandra support >>> http://www.datastax.com >>> >> >