Can you try turning up the logging to TRACE, if you know how turn it up for the 
org.apache.cassandra.gms.FailureDetector and org.apache.cassandra.gms.Gossiper 
loggers

You'll see a lot of messages. Some of the interesting ones will be...

"FailureDetector.java (line 128) reporting /127.0.0.2"  - logged when the end 
point reports 
"FailureDetector.java (line 149) PHI for /127.0.0.2 : 0.9127981004059462" - 
logged when an end point is checked before possibly marking dead

There will also be messages from the gossiper about receiving and sending.

Aaron
 
On 29 Apr 2011, at 14:50, Sheng Chen wrote:

> Thank you for your patch. 
> I believe the latter version I used (the latest 0.7 branch) includes the 
> patch, but the problem remains.
> 
> Is there anything else that may block this heartbeat, like gc?
> Here are some logs during heartbeat failure.
> 
>  INFO [GossipTasks:1] 2011-04-29 07:25:09,716 Gossiper.java (line 228) 
> InetAddress /192.168.125.51 is now dead.
>  INFO [GossipStage:1] 2011-04-29 07:25:20,128 Gossiper.java (line 609) 
> InetAddress /192.168.125.51 is now UP
> ----------
>  INFO [FlushWriter:1] 2011-04-29 07:25:00,135 Memtable.java (line 172) 
> Completed flushing /data/cassandra/data/Keyspace1/Standard1-f-4041-Data.db 
> (76170930 bytes)
>  INFO [COMMIT-LOG-WRITER] 2011-04-29 07:25:00,140 CommitLog.java (line 440) 
> Discarding obsolete commit 
> log:CommitLogSegment(/commitlog/CommitLog-1304033061025.log)
>  INFO [ScheduledTasks:1] 2011-04-29 07:25:20,399 GCInspector.java (line 128) 
> GC for ParNew: 511 ms, -71757976 reclaimed leaving 6614730672 used; max is 
> 8466202624
>  INFO [ScheduledTasks:1] 2011-04-29 07:25:34,632 GCInspector.java (line 128) 
> GC for ParNew: 238 ms, 612197280 reclaimed leaving 4429643304 used; max is 
> 8466202624
> ----------
> 
>  INFO [GossipTasks:1] 2011-04-29 07:17:48,833 Gossiper.java (line 228) 
> InetAddress /192.168.125.51 is now dead.
>  INFO [GossipStage:1] 2011-04-29 07:17:48,854 Gossiper.java (line 609) 
> InetAddress /192.168.125.51 is now UP
> ----------
>  INFO [GossipTasks:1] 2011-04-29 07:17:33,503 Gossiper.java (line 228) 
> InetAddress /192.168.125.49 is now dead.
>  INFO [MutationStage:3] 2011-04-29 07:17:47,751 ColumnFamilyStore.java (line 
> 1064) Enqueuing flush of Memtable-Standard1@461979209(61555223 bytes, 1218975 
> operations)
>  INFO [FlushWriter:1] 2011-04-29 07:17:47,751 Memtable.java (line 157) 
> Writing Memtable-Standard1@461979209(61555223 bytes, 1218975 operations)
>  INFO [GossipStage:1] 2011-04-29 07:17:49,123 Gossiper.java (line 609) 
> InetAddress /192.168.125.49 is now UP
> ----------
> 
>  INFO [GossipTasks:1] 2011-04-29 07:09:46,072 Gossiper.java (line 228) 
> InetAddress /192.168.125.51 is now dead.
>  INFO [GossipStage:1] 2011-04-29 07:09:59,587 Gossiper.java (line 609) 
> InetAddress /192.168.125.51 is now UP
> ----------
>  INFO [FlushWriter:1] 2011-04-29 07:09:36,917 Memtable.java (line 172) 
> Completed flushing /data/cassandra/data/Keyspace1/Standard1-f-4005-Data.db 
> (76199863 bytes)
>  INFO [COMMIT-LOG-WRITER] 2011-04-29 07:09:36,921 CommitLog.java (line 440) 
> Discarding obsolete commit 
> log:CommitLogSegment(/commitlog/CommitLog-1304032122510.log)
>  INFO [COMMIT-LOG-WRITER] 2011-04-29 07:09:38,089 CommitLogSegment.java (line 
> 50) Creating new commitlog segment /commitlog/CommitLog-1304032178089.log
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,858 GCInspector.java (line 128) 
> GC for ParNew: 511 ms, -61586864 reclaimed leaving 6496071272 used; max is 
> 8466202624
>  WARN [ScheduledTasks:1] 2011-04-29 07:09:59,858 MessagingService.java (line 
> 504) Dropped 1 REQUEST_RESPONSE messages in the last 5000ms
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,859 StatusLogger.java (line 51) 
> Pool Name                    Active   Pending
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,859 StatusLogger.java (line 66) 
> ReadStage                         0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,859 StatusLogger.java (line 66) 
> RequestResponseStage              0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,860 StatusLogger.java (line 66) 
> ReadRepairStage                   0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,860 StatusLogger.java (line 66) 
> MutationStage                     0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,860 StatusLogger.java (line 66) 
> GossipStage                       0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,860 StatusLogger.java (line 66) 
> AntiEntropyStage                  0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,861 StatusLogger.java (line 66) 
> MigrationStage                    0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,861 StatusLogger.java (line 66) 
> StreamStage                       0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,861 StatusLogger.java (line 66) 
> MemtablePostFlusher               0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,861 StatusLogger.java (line 66) 
> FILEUTILS-DELETE-POOL             0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,862 StatusLogger.java (line 66) 
> FlushWriter                       0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,862 StatusLogger.java (line 66) 
> MiscStage                         0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,862 StatusLogger.java (line 66) 
> FlushSorter                       0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 66) 
> InternalResponseStage             0         0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 66) 
> HintedHandoff                     1         2
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 70) 
> CompactionManager               n/a      4067
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 82) 
> MessagingService                n/a       0,0
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 86) 
> ColumnFamily                Memtable ops,data  Row cache size/cap  Key cache 
> size/cap
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89) 
> Keyspace1.Super1                          0,0                 0/0            
> 0/200000
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89) 
> Keyspace1.Standard1           410835,20744126                 0/0        
> 36172/200000
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89) 
> system.IndexInfo                          0,0                 0/0             
>     0/1
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89) 
> system.LocationInfo                      1,20                 0/0             
>     2/2
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,865 StatusLogger.java (line 89) 
> system.Migrations                         0,0                 0/0             
>     0/2
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,865 StatusLogger.java (line 89) 
> system.HintsColumnFamily       206378,8667876                 0/0             
>    0/26
>  INFO [ScheduledTasks:1] 2011-04-29 07:09:59,865 StatusLogger.java (line 89) 
> system.Schema                             0,0                 0/0             
>     0/2
>  INFO [ScheduledTasks:1] 2011-04-29 07:10:06,872 GCInspector.java (line 128) 
> GC for ParNew: 219 ms, 589477312 reclaimed leaving 4204479728 used; max is 
> 8466202624
> ---------- 
>  
>  INFO [GossipTasks:1] 2011-04-29 07:03:43,262 Gossiper.java (line 228) 
> InetAddress /192.168.125.51 is now dead.
>  INFO [GossipStage:1] 2011-04-29 07:03:43,270 Gossiper.java (line 609) 
> InetAddress /192.168.125.51 is now UP
>  ----------
>  INFO [FlushWriter:1] 2011-04-29 07:03:41,061 Memtable.java (line 157) 
> Writing Memtable-Standard1@595540423(61557029 bytes, 1218975 operations)
>  INFO [GossipStage:1] 2011-04-29 07:03:43,538 Gossiper.java (line 609) 
> InetAddress /192.168.125.49 is now UP
>  INFO [FlushWriter:1] 2011-04-29 07:03:46,550 Memtable.java (line 172) 
> Completed flushing /data/cassandra/data/Keyspace1/Standard1-f-3991-Data.db 
> (76184729 bytes)
>  
> 
> 
> 2011/4/28 Jonathan Ellis <jbel...@gmail.com>
> This means a node was too busy with something else to send out its
> heartbeat. Sometimes this is STW GC. Other times it is a bug (one was
> fixed for 0.7.6 in
> https://issues.apache.org/jira/browse/CASSANDRA-2554).
> 
> On Thu, Apr 28, 2011 at 3:57 AM, Sheng Chen <chensheng2...@gmail.com> wrote:
> > Thank you for your advice. Rf>=2 is a good work around.
> > I was using 0.7.4 and have updated to the latest 0.7 branch, which includes
> > 2554 patch.
> > But it doesn't help. I still get lots of UnavailableException after the
> > following logs,
> >  INFO [GossipTasks:1] 2011-04-28 16:12:17,661 Gossiper.java (line 228)
> > InetAddress /192.168.125.49 is now dead.
> >  INFO [GossipStage:1] 2011-04-28 16:12:19,627 Gossiper.java (line 609)
> > InetAddress /192.168.125.49 is now UP
> >  INFO [HintedHandoff:1] 2011-04-28 16:13:11,452 HintedHandOffManager.java
> > (line 304) Started hinted handoff for endpoint /192.168.125.49
> >  INFO [HintedHandoff:1] 2011-04-28 16:13:11,453 HintedHandOffManager.java
> > (line 360) Finished hinted handoff of 0 rows to endpoint /192.168.125.49
> > It seems that the gossip failure detection is too sensitive. Is there any
> > configuration?
> >
> >
> >
> >
> >
> > 2011/4/27 Sylvain Lebresne <sylv...@datastax.com>
> >>
> >> On Wed, Apr 27, 2011 at 10:32 AM, Sheng Chen <chensheng2...@gmail.com>
> >> wrote:
> >> > I succeeded to insert 1 billion records into a single node cassandra,
> >> >>> bin/stress -d cas01 -o insert -n 1000000000 -c 5 -S 34 -C5 -t 20
> >> > Inserts finished in about 14 hours at a speed of 20k/sec.
> >> > But when I added another node, tests always failed with
> >> > UnavailableException
> >> > in an hour.
> >> >>> bin/stress -d cas01,cas02 -o insert -n 1000000000 -c 5 -S 34 -C5 -t 20
> >> > Writes speed is also 20k/sec because of the bottleneck in the client, so
> >> > the
> >> > pressure on each server node should be 50% of the single node test.
> >> > Why couldn't they handle?
> >> > By default, rf=1, consistency=ONE
> >> > Some information that may be helpful,
> >> > 1. no warn/error in log file, the cluster is still alive after those
> >> > exception
> >> > 2. the last logs on both nodes happen to be a compaction complete info
> >> > 3. gossip log shows one node is dead and then up again in 3 seconds
> >>
> >> That's your problem. Once marked down (and since rf=1), when an update for
> >> cas02 reach cas01 and cas01 has marked cas02 down, it will throw the
> >> UnavailableException.
> >>
> >> Now, it shouldn't have been marked down and I suspect this is due to
> >> https://issues.apache.org/jira/browse/CASSANDRA-2554
> >> (even though you didn't tell which version you're using, I suppose
> >> this is a 0.7.*).
> >>
> >> If you apply this patch or use the svn current 0.7 branch, that should
> >> hopefully
> >> not happen again.
> >>
> >> Note that if you had rf >= 2, the node would still have been marked down
> >> wrongly
> >> for 3 seconds, but that would have been transparent to the stress test.
> >>
> >> > 4. I set hinted_handoff_enabled: false, but still see lots of handoff
> >> > logs
> >>
> >> What are those saying ?
> >>
> >> --
> >> Sylvain
> >
> >
> 
> 
> 
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
> 

Reply via email to