I think some figures from "nodetool tpstats" and "nodetool compactionstats"
may help seeing clearer

And Pavel, when you said batch, did you mean LOGGED batch or UNLOGGED batch
?





On Fri, Jun 20, 2014 at 8:02 PM, Marcelo Elias Del Valle <
marc...@s1mbi0se.com.br> wrote:

> If you have 32 Gb RAM, the heap is probably 8Gb.
> 200 writes of 100 kb / s would be 20MB / s in the worst case, supposing
> all writes of a replica goes to a single node.
> I really don't see any reason why it should be filling up the heap.
> Anyone else?
>
> But did you check the logs for the GCInspector?
> In my case, nodes are falling because of the heap, in your case, maybe
> it's something else.
> Do you see increased times when looking for GCInspector in the logs?
>
> []s
>
>
>
> 2014-06-20 14:51 GMT-03:00 Pavel Kogan <pavel.ko...@cortica.com>:
>
> Hi Marcelo,
>>
>> No pending write tasks, I am writing a lot, about 100-200 writes each up
>> to 100Kb every 15[s].
>> It is running on decent cluster of 5 identical nodes, quad cores i7 with
>> 32Gb RAM and 480Gb SSD.
>>
>> Regards,
>>   Pavel
>>
>>
>> On Fri, Jun 20, 2014 at 12:31 PM, Marcelo Elias Del Valle <
>> marc...@s1mbi0se.com.br> wrote:
>>
>>> Pavel,
>>>
>>> In my case, the heap was filling up faster than it was draining. I am
>>> still looking for the cause of it, as I could drain really fast with SSD.
>>>
>>> However, in your case you could check (AFAIK) nodetool tpstats and see
>>> if there are too many pending write tasks, for instance. Maybe you really
>>> are writting more than the nodes are able to flush to disk.
>>>
>>> How many writes per second are you achieving?
>>>
>>> Also, I would look for GCInspector in the log:
>>>
>>> cat system.log* | grep GCInspector | wc -l
>>> tail -1000 system.log | grep GCInspector
>>>
>>> Do you see it running a lot? Is it taking much more time to run each
>>> time it runs?
>>>
>>> I am no Cassandra expert, but I would try these things first and post
>>> the results here. Maybe other people in the list have more ideas.
>>>
>>> Best regards,
>>> Marcelo.
>>>
>>>
>>> 2014-06-20 8:50 GMT-03:00 Pavel Kogan <pavel.ko...@cortica.com>:
>>>
>>> The cluster is new, so no updates were done. Version 2.0.8.
>>>> It happened when I did many writes (no reads). Writes are done in small
>>>> batches of 2 inserts (writing to 2 column families). The values are big
>>>> blobs (up to 100Kb).
>>>>
>>>> Any clues?
>>>>
>>>> Pavel
>>>>
>>>>
>>>> On Thu, Jun 19, 2014 at 8:07 PM, Marcelo Elias Del Valle <
>>>> marc...@s1mbi0se.com.br> wrote:
>>>>
>>>>> Pavel,
>>>>>
>>>>> Out of curiosity, did it start to happen before some update? Which
>>>>> version of Cassandra are you using?
>>>>>
>>>>> []s
>>>>>
>>>>>
>>>>> 2014-06-19 16:10 GMT-03:00 Pavel Kogan <pavel.ko...@cortica.com>:
>>>>>
>>>>>> What a coincidence! Today happened in my cluster of 7 nodes as well.
>>>>>>
>>>>>> Regards,
>>>>>>   Pavel
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 18, 2014 at 11:13 AM, Marcelo Elias Del Valle <
>>>>>> marc...@s1mbi0se.com.br> wrote:
>>>>>>
>>>>>>> I have a 10 node cluster with cassandra 2.0.8.
>>>>>>>
>>>>>>> I am taking this exceptions in the log when I run my code. What my
>>>>>>> code does is just reading data from a CF and in some cases it writes new
>>>>>>> data.
>>>>>>>
>>>>>>>  WARN [Native-Transport-Requests:553] 2014-06-18 11:04:51,391
>>>>>>> BatchStatement.java (line 228) Batch of prepared statements for
>>>>>>> [identification1.entity, identification1.entity_lookup] is of size 6165,
>>>>>>> exceeding specified threshold of 5120 by 1045.
>>>>>>>  WARN [Native-Transport-Requests:583] 2014-06-18 11:05:01,152
>>>>>>> BatchStatement.java (line 228) Batch of prepared statements for
>>>>>>> [identification1.entity, identification1.entity_lookup] is of size 
>>>>>>> 21266,
>>>>>>> exceeding specified threshold of 5120 by 16146.
>>>>>>>  WARN [Native-Transport-Requests:581] 2014-06-18 11:05:20,229
>>>>>>> BatchStatement.java (line 228) Batch of prepared statements for
>>>>>>> [identification1.entity, identification1.entity_lookup] is of size 
>>>>>>> 22978,
>>>>>>> exceeding specified threshold of 5120 by 17858.
>>>>>>>  INFO [MemoryMeter:1] 2014-06-18 11:05:32,682 Memtable.java (line
>>>>>>> 481) CFS(Keyspace='OpsCenter', ColumnFamily='rollups300') liveRatio is
>>>>>>> 14.249755859375 (just-counted was 9.85302734375).  calculation took 3ms 
>>>>>>> for
>>>>>>> 1024 cells
>>>>>>>
>>>>>>> After some time, one node of the cluster goes down. Then it goes
>>>>>>> back after some seconds and another node goes down. It keeps happening 
>>>>>>> and
>>>>>>> there is always a node down in the cluster, when it goes back another 
>>>>>>> one
>>>>>>> falls.
>>>>>>>
>>>>>>> The only exceptions I see in the log is "connected reset by the
>>>>>>> peer", which seems to be relative to gossip protocol, when a node goes 
>>>>>>> down.
>>>>>>>
>>>>>>> Any hint of what could I do to investigate this problem further?
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Marcelo Valle.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to