Followup,
I was mistaken in saying there weren't writes to the cluster. There's a
process that's doing a couple mutations per second.

I just restarted node #3 and found this message on node #1
 INFO [HintedHandoff:1] 2011-10-05 12:25:08,173 HintedHandOffManager.java
(line 314) Endpoint /xx.xx.xx.180 died before hint delivery, aborting

Could HH have stuck the nodes on the receiving end ? Is there any way to
throttle this ?
If it can't be throttled and you confirm HH is a suspect I may simply
disable it as I'm running a repair 3 times a week (once per node) so I guess
my cluster won't be too out of sync.

Thanks

2011/10/5 Philippe <watche...@gmail.com>

> Thanks for the quick responses.
>
> @Yi
> Using Hector 0.8.0-1
> Hardware is :
>
>    - AMD Opteron 4174 6x 2.30+ GHz
>    - 32 Go DDR3
>    - 1 Gbps Lossless
>
>
> @aaron
> I'm running 0.8.6 on all nodes, straight from the debian packages.
> I get hinted handoffs from time to time because of flapping, I've planning
> to increase the phi as per another thread but haven't yet.
> Here are the HH per node::
> HintedHandoff                     0         0            437
> 0                 0
> HintedHandoff                     0         0              2
> 0                 0
>
> HintedHandoff                     0         0           1798
> 0                 0
> Not seeing any iowait from my munin CPU graph, at least not more than the
> past couple of weeks. There is a little more on the 2nd node because it's
> also holding a mysql database that gets hit hard.
> Munin iostats graph shows an average 10-20kilo blocks/s read/write.
>
> Nothing special was happening besides a weekly repair on node two starting
> yesterday at 4am. That one failed with
> ERROR [AntiEntropySessions:5] 2011-10-04 04:03:56,676
> AbstractCassandraDaemon.java (line 139) Fatal exception in thread
> Thread[AntiEntropySessions:5,5,RMI Runtime]
> java.lang.RuntimeException: java.io.IOException: Problem during repair
> session manual-repair-0d4125f0-aab7-403b-a083-7c19ef6579b1, endpoint
> /xx.xx.xx.180 died
> Then the next planned repairs failed before starting
> INFO [AntiEntropySessions:8] 2011-10-04 04:03:57,137
> AntiEntropyService.java (line 658) Could not proceed on repair because a
> neighbor (/xx.xx.xx.180) is dead:
> manual-repair-5dc4c221-3a15-4031-9aa8-0931e41816cd failed
>
> Looking at the logs on that node shows no Exception. And I was about to
> say, "nothing special happening at that time" except that it looks like at
> 4am, the GC started working hard and got the heap down to 9GB and then it
> shot straight up to almost 16GB so I guess ParNew couldn't keep up and
> ConcurrentMarkSweep had to step in and basically hang the server ? It took
> another 2 minutes until I get the "Heap is 0.75 full" message, I get a lot
> of StatusLogger messages before that.
> So it looks like computing the Merkle tree was very expensive this time...
> I wonder why ? Anything I can do to handle this ?
>
> INFO [ScheduledTasks:1] 2011-10-04 04:03:12,874 GCInspector.java (line 122)
> GC for ParNew: 427 ms for 2 collections, 16227154488 used; max is
> 16838033408
>  INFO [GossipTasks:1] 2011-10-04 04:04:24,092 Gossiper.java (line 697)
> InetAddress /xx.xx.xx.97 is now dead.
>  INFO [ScheduledTasks:1] 2011-10-04 04:04:24,093 GCInspector.java (line
> 122) GC for ParNew: 26376 ms for 2 collections, 8832125416 used; max is
> 16838033408
> (not GC logs until)
>  INFO [ScheduledTasks:1] 2011-10-04 04:04:24,251 GCInspector.java (line
> 122) GC for ConcurrentMarkSweep: 16250 ms for 3 collections, 9076209720used; 
> max is 16838033408
>  WARN [ScheduledTasks:1] 2011-10-04 04:06:52,777 GCInspector.java (line
> 143) Heap is 0.752707974197173 full.  You may need to reduce memtable and/or
> cache sizes.  Cassandra will now flush up to the two largest memtables to
> free up memory.  Adjust flush_largest_memtables_at threshold in
> cassandra.yaml if you don't want Cassandra to do this automatically
>
>
>
> 2011/10/5 aaron morton <aa...@thelastpickle.com>
>
>> Lots of hinted handoff can give you mutations…
>>
>> HintedHandoff                     0         0           1798
>> 0                 0
>>
>>
>> 1798 is somewhat high. This is the HH tasks on this node though, can you
>> see HH running on other nodes in the cluster? What has been happening on
>> this node ?
>>
>>   HH is throttled to avoid this sort of thing, what version are you on ?
>>
>> Also looks like the disk IO could not keep up with the flushing….
>>
>> FlushWriter                       0         0           5714
>> 0               499
>>
>> You need to provide some more info on what was happening to nodes before
>> hand. And check the logs on all machines for errors etc.
>>
>> Cheers
>>
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 5/10/2011, at 9:52 PM, Yi Yang wrote:
>>
>> Well what client are you using? And can you give a hint to your node
>> hardware?
>>
>> 從我的 BlackBerry® 無線裝置
>> ------------------------------
>> *From: * Philippe <watche...@gmail.com>
>> *Date: *Wed, 5 Oct 2011 10:33:21 +0200
>> *To: *user<user@cassandra.apache.org>
>> *ReplyTo: * user@cassandra.apache.org
>> *Subject: *Why is mutation stage increasing ??
>>
>> Hello,
>> I have my 3-node, RF=3 cluster acting strangely. Can someone shed a light
>> as to what is going on ?
>> It was stuck for a couple of hours (all clients TimedOut). nodetool
>> tpstats showed huge increasing MutationStages (in the hundreds of
>> thousands).
>> I restarted one node and it took a while to reply GB of commitlog. I've
>> shutdown all clients that write to the cluster and it's just weird
>>
>> All nodes are still showing huge MutationStages including the new one and
>> it's either increasing or stable. The pending count is stuck at 32.
>> Compactionstats shows no compaction on 2 nodes and dozens of Scrub
>> compactions (all at 100%) on the 3rd one. This is a scrub I did last week
>> when I encountered assertion errors.
>> Netstats shows no streams being exchanged at any node but each on is
>> expecting a few Responses.
>>
>> Any ideas ?
>> Thanks
>>
>> For example (increased to 567062 while I was writing this email)
>> Pool Name                    Active   Pending      Completed   Blocked
>> All time blocked
>> ReadStage                         0         0    18372664517
>> 0                 0
>> RequestResponseStage              0         0    10731370183
>> 0                 0
>> MutationStage                    32    565879      295492216
>> 0                 0
>> ReadRepairStage                   0         0          23654
>> 0                 0
>> ReplicateOnWriteStage             0         0        7733659
>> 0                 0
>> GossipStage                       0         0        3502922
>> 0                 0
>> AntiEntropyStage                  0         0           1631
>> 0                 0
>> MigrationStage                    0         0              0
>> 0                 0
>> MemtablePostFlusher               0         0           5716
>> 0                 0
>> StreamStage                       0         0             10
>> 0                 0
>> FlushWriter                       0         0           5714
>> 0               499
>> FILEUTILS-DELETE-POOL             0         0            773
>> 0                 0
>> MiscStage                         0         0           1266
>> 0                 0
>> FlushSorter                       0         0              0
>> 0                 0
>> AntiEntropySessions               0         0             18
>> 0                 0
>> InternalResponseStage             0         0              0
>> 0                 0
>> HintedHandoff                     0         0           1798
>> 0                 0
>>
>>
>> Mode: Normal
>> Not sending any streams.
>> Not receiving any streams.
>> Pool Name                    Active   Pending      Completed
>> Commands                        n/a         0     1223769753
>> Responses                       n/a         4     1627481305
>>
>>
>>
>

Reply via email to