Rahul,

This problem occurs every now and then, and currently everything is ok, so
there are no hints. But whenever it happens, the hints are quickly piling
up. This results in heap problems on the node ("Heap is 0.813462 full..."
appears many times). This in turn results in the flushing of the 'hints'
column family, to relieve memory pressure. According to the log message,
the size varies between 50 and 60MB). But since the HintedHandoffManager is
reading from the hints CF, it will probably pull it back into a memtable
again -- that's at least my understanding of how it works.

So I guess that flushing the hints CF while the HintedHandoffManager is
working on it only makes things worse, and it could be the reason that the
process never ends.

What I typically see when this happens is that the hints keep piling up,
and eventually the node comes to a grinding halt (OOM). Then I have to
rebuild the node entirely (only removing the hints doesn't work).

The reason for hints to start accumulating in the first place might be a
spike in CF writes that must be replicated to a node in another data
center. The available bandwidth to that data center might not be able to
handle the data quickly enough, resulting in stored hints. The
HintedHandoff task that is started is targeting that remote node.


Thanks,
Tom


On Tue, Dec 3, 2013 at 2:22 PM, Rahul Menon <ra...@apigee.com> wrote:

> Tom,
>
> Do you know why these hints are piling up? What is the size of the hints
> cf?
>
> Thanks
> Rahul
>
>
> On Tue, Dec 3, 2013 at 6:41 PM, Tom van den Berge <t...@drillster.com>wrote:
>
>> Hi Rahul,
>>
>> Thanks for your reply.
>>
>> I have never seen message like "Timed out replaying hints to...", which
>> is a good thing then, I suppose ;)
>>
>> Normally, I do see the "Finished hinted handoff..." log message. However,
>> every now and then this message is not logged, not even after several
>> hours. This is the problem I'm trying to solve.
>>
>> The log messages you describe are quite course-grained; they only tell
>> you that a task has started or finished, but not how this task is
>> progressing. And that's exactly what I would like to know if I see that a
>> task has started, but has not finished after a reasonable amount of time.
>>
>> So I guess the only way to see learn the progress is to look inside the
>> 'hints' column family then.I'll give that a try.
>>
>>
>> Thanks,
>> Tom
>>
>>
>> On Tue, Dec 3, 2013 at 1:43 PM, Rahul Menon <ra...@apigee.com> wrote:
>>
>>> Tom,
>>>
>>> You should check the size of the hints column family to determine how
>>> much are present. The hints are a super column family and its keys are
>>> destination tokens. You could look at it if you would like.
>>>
>>> Hints send and timedouts are logged, you should be seeing something like
>>>
>>> Timed out replaying hints to {}; aborting ({} delivered
>>>
>>>
>>>
>>>
>>>
>>>
>>> OR
>>>
>>> Finished hinted handoff of {} rows to endpoint {}
>>>
>>>
>>>
>>> Thanks
>>> Rahul
>>>
>>>
>>> On Tue, Dec 3, 2013 at 2:36 PM, Tom van den Berge <t...@drillster.com>wrote:
>>>
>>>> Hi,
>>>>
>>>> Is there a way to monitor the progress of a hinted handoff task?
>>>>
>>>> I found the following two mbeans providing some info:
>>>>
>>>> org.apache.cassandra.internal:type=HintedHandoff, which tells me that
>>>> there is 1 active task, and
>>>> org.apache.cassandra.db:type=HintedHandoffManager#countPendingHints(),
>>>> which quite often gives a timeout when executed.
>>>>
>>>> Ideally, I would like to see how many hints have been sent (e.g. over
>>>> the last minute or so), and how many hints are still to be sent (although I
>>>> assume that's what countPendingHints normally does?)
>>>>
>>>> I'm experiencing hinted handoff tasks that are started, but never
>>>> finish, so I would like to know what the task is doing.
>>>>
>>>> My log shows this:
>>>>
>>>> INFO [HintedHandoff:1] 2013-12-02
>>>> 13:49:05,325 HintedHandOffManager.java (line 297) Started hinted handoff
>>>> for host: 6f80b942-5b6d-4233-9827-3727591abf55 with IP: /10.55.156.66
>>>> (nothing more for [HintedHandoff:1])
>>>>
>>>> The node is up and running, the network connection is ok, no gossip
>>>> messages appear in the logs.
>>>>
>>>> Any idea is welcome.
>>>> (Casandra 1.2.3)
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Drillster BV
>>>> Middenburcht 136
>>>> 3452MT Vleuten
>>>> Netherlands
>>>>
>>>> +31 30 755 5330
>>>>
>>>> Open your free account at www.drillster.com
>>>>
>>>
>>>
>>
>>
>> --
>>
>> Drillster BV
>> Middenburcht 136
>> 3452MT Vleuten
>> Netherlands
>>
>> +31 30 755 5330
>>
>> Open your free account at www.drillster.com
>>
>
>


-- 

Drillster BV
Middenburcht 136
3452MT Vleuten
Netherlands

+31 30 755 5330

Open your free account at www.drillster.com

Reply via email to