Tom, you should look at phi_convict_threshold and try and increase the value if you have too much chatter on your network.
Also, rebuilding the entire node because of a OOM does not make sense, could you please post the C* version that you are using & the head size you have configured? Thanks Rahul On Tue, Dec 3, 2013 at 7:41 PM, Tom van den Berge <t...@drillster.com> wrote: > Rahul, > > This problem occurs every now and then, and currently everything is ok, so > there are no hints. But whenever it happens, the hints are quickly piling > up. This results in heap problems on the node ("Heap is 0.813462 full..." > appears many times). This in turn results in the flushing of the 'hints' > column family, to relieve memory pressure. According to the log message, > the size varies between 50 and 60MB). But since the HintedHandoffManager is > reading from the hints CF, it will probably pull it back into a memtable > again -- that's at least my understanding of how it works. > > So I guess that flushing the hints CF while the HintedHandoffManager is > working on it only makes things worse, and it could be the reason that the > process never ends. > > What I typically see when this happens is that the hints keep piling up, > and eventually the node comes to a grinding halt (OOM). Then I have to > rebuild the node entirely (only removing the hints doesn't work). > > The reason for hints to start accumulating in the first place might be a > spike in CF writes that must be replicated to a node in another data > center. The available bandwidth to that data center might not be able to > handle the data quickly enough, resulting in stored hints. The > HintedHandoff task that is started is targeting that remote node. > > > Thanks, > Tom > > > On Tue, Dec 3, 2013 at 2:22 PM, Rahul Menon <ra...@apigee.com> wrote: > >> Tom, >> >> Do you know why these hints are piling up? What is the size of the hints >> cf? >> >> Thanks >> Rahul >> >> >> On Tue, Dec 3, 2013 at 6:41 PM, Tom van den Berge <t...@drillster.com>wrote: >> >>> Hi Rahul, >>> >>> Thanks for your reply. >>> >>> I have never seen message like "Timed out replaying hints to...", which >>> is a good thing then, I suppose ;) >>> >>> Normally, I do see the "Finished hinted handoff..." log message. >>> However, every now and then this message is not logged, not even after >>> several hours. This is the problem I'm trying to solve. >>> >>> The log messages you describe are quite course-grained; they only tell >>> you that a task has started or finished, but not how this task is >>> progressing. And that's exactly what I would like to know if I see that a >>> task has started, but has not finished after a reasonable amount of time. >>> >>> So I guess the only way to see learn the progress is to look inside the >>> 'hints' column family then.I'll give that a try. >>> >>> >>> Thanks, >>> Tom >>> >>> >>> On Tue, Dec 3, 2013 at 1:43 PM, Rahul Menon <ra...@apigee.com> wrote: >>> >>>> Tom, >>>> >>>> You should check the size of the hints column family to determine how >>>> much are present. The hints are a super column family and its keys are >>>> destination tokens. You could look at it if you would like. >>>> >>>> Hints send and timedouts are logged, you should be seeing something >>>> like >>>> >>>> Timed out replaying hints to {}; aborting ({} delivered >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> OR >>>> >>>> Finished hinted handoff of {} rows to endpoint {} >>>> >>>> >>>> >>>> Thanks >>>> Rahul >>>> >>>> >>>> On Tue, Dec 3, 2013 at 2:36 PM, Tom van den Berge >>>> <t...@drillster.com>wrote: >>>> >>>>> Hi, >>>>> >>>>> Is there a way to monitor the progress of a hinted handoff task? >>>>> >>>>> I found the following two mbeans providing some info: >>>>> >>>>> org.apache.cassandra.internal:type=HintedHandoff, which tells me that >>>>> there is 1 active task, and >>>>> org.apache.cassandra.db:type=HintedHandoffManager#countPendingHints(), >>>>> which quite often gives a timeout when executed. >>>>> >>>>> Ideally, I would like to see how many hints have been sent (e.g. over >>>>> the last minute or so), and how many hints are still to be sent (although >>>>> I >>>>> assume that's what countPendingHints normally does?) >>>>> >>>>> I'm experiencing hinted handoff tasks that are started, but never >>>>> finish, so I would like to know what the task is doing. >>>>> >>>>> My log shows this: >>>>> >>>>> INFO [HintedHandoff:1] 2013-12-02 >>>>> 13:49:05,325 HintedHandOffManager.java (line 297) Started hinted handoff >>>>> for host: 6f80b942-5b6d-4233-9827-3727591abf55 with IP: /10.55.156.66 >>>>> (nothing more for [HintedHandoff:1]) >>>>> >>>>> The node is up and running, the network connection is ok, no gossip >>>>> messages appear in the logs. >>>>> >>>>> Any idea is welcome. >>>>> (Casandra 1.2.3) >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Drillster BV >>>>> Middenburcht 136 >>>>> 3452MT Vleuten >>>>> Netherlands >>>>> >>>>> +31 30 755 5330 >>>>> >>>>> Open your free account at www.drillster.com >>>>> >>>> >>>> >>> >>> >>> -- >>> >>> Drillster BV >>> Middenburcht 136 >>> 3452MT Vleuten >>> Netherlands >>> >>> +31 30 755 5330 >>> >>> Open your free account at www.drillster.com >>> >> >> > > > -- > > Drillster BV > Middenburcht 136 > 3452MT Vleuten > Netherlands > > +31 30 755 5330 > > Open your free account at www.drillster.com >