Re: frequent node up/downs

feedly team Fri, 06 Jul 2012 16:49:33 -0700

responses below. thanks!

On Fri, Jul 6, 2012 at 3:09 PM, aaron morton <aa...@thelastpickle.com>wrote:


> It looks like this happens when there is a promotion failure.
>
>
> Java Heap is full.
> Memory is fragmented.
> Use C for web scale.
>
unfortunately i became too dumb to use C around 2004. camping accident.

>
> Also is it normal to see the "Heap is xx full.  You may need to reduce
> memtable and/or cache sizes" message quite often? I haven't turned on row
> caches or changed any default memtable size settings so I am wondering why
> the old gen fills up.
>
>
> It's odd to get that out of the box with an 8GB heap on a 1.1.X install.
>
> What sort of work load ? Is it under heavy inserts ?
>
opscenter shows between 60-120 writes/sec and between 80-150 reads/sec
total for both machines. i am not sure if that is considered heavy or not.
the machines don't seem particularly busy. load seems pretty even across
both.

Do you have a lot of CF's ? A lot of secondary indexes ?
>
i have 15 column families with maybe 4 that are larger and active. there
are a couple secondary indexes. opscenter uses 8 CFs and system 7. total
data is ~100GB

After the messages is it able to reduce heap usage ?
>
 seems like it, they occur every few minutes for awhile and then stop.

Does it seem to correlate to compactions ?
>
no.


> Is the node able to get back to a healthy state ?
>
yes. after the gc finishes it rejoins the cluster.


> If this is testing are you able to pull back to a workload where the
> issues doe not appear ?
>

i am guessing so. i am running a data-heavy background processing job. when
i reduced thread count from 20 to 15 the problem has happened only once in
the past 2 days vs 2-3 times a day. we are just starting to use cassandra
so i am more worried about when more critical web traffic hits.


>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 7/07/2012, at 4:33 AM, feedly team wrote:
>
> I reduced the load and the problem hasn't been happening as much. After
> enabling gc logging, I see messages mentioning promotion failed when the
> pauses happen. It looks like this happens when there is a promotion
> failure. From reading on the web it looks like I could try reducing the
> CMSInitiatingOccupancyFraction value and/or decreasing the young gen size
> to try to avoid this scenario.
>
> Also is it normal to see the "Heap is xx full.  You may need to reduce
> memtable and/or cache sizes" message quite often? I haven't turned on row
> caches or changed any default memtable size settings so I am wondering why
> the old gen fills up.
>
>
> On Wed, Jul 4, 2012 at 6:28 AM, aaron morton <aa...@thelastpickle.com>wrote:
>
>> What accounts for the much larger virtual number? some kind of off-heap
>> memory?
>>
>> http://wiki.apache.org/cassandra/FAQ#mmap
>>
>> I'm a little puzzled as to why I would get such long pauses without
>> swapping.
>>
>> The two are not related. On startup the JVM memory is locked so it will
>> not swap, from then on memory management is pretty much up the JVM.
>>
>> Getting a lot of ParNew activity does not mean the JVM is low on memory,
>> it means there is a lot of activity in the new heap.
>>
>> If you have a lot of insert activity (typically in a load test) you can
>> generate a lot of GC activity. Try reducing the load to a point where it
>> does not ht GC and then increase to find the cause. Also if you can connect
>> JConole to the JVM you may get a better view of the heap usage.
>>
>> Hope that helps.
>>
>>   -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 3/07/2012, at 3:41 PM, feedly team wrote:
>>
>> Couple more details. I confirmed that swap space is not being used (free
>> -m shows 0 swap) and cassandra.log has a message like "JNA mlockall
>> successful". top shows the process having 9g in resident memory but 21.6g
>> in virtual...What accounts for the much larger virtual number? some kind of
>> off-heap memory?
>>
>> I'm a little puzzled as to why I would get such long pauses without
>> swapping. I uncommented all the gc logging options in cassandra-env.sh to
>> try to see what is going on when the node freezes.
>>
>> Thanks
>> Kireet
>>
>> On Mon, Jul 2, 2012 at 9:51 PM, feedly team <feedly...@gmail.com> wrote:
>>
>>> Yeah I noticed the leap second problem and ran the suggested fix, but I
>>> have been facing these problems before Saturday and still see the
>>> occasional failures after running the fix.
>>>
>>> Thanks.
>>>
>>>
>>> On Mon, Jul 2, 2012 at 11:17 AM, Marcus Both <mb...@terra.com.br> wrote:
>>>
>>>> Yeah! Look that.
>>>>
>>>> http://arstechnica.com/business/2012/07/one-day-later-the-leap-second-v-the-internet-scorecard/
>>>> I had the same problem. The solution was rebooting.
>>>>
>>>> On Mon, 2 Jul 2012 11:08:57 -0400
>>>> feedly team <feedly...@gmail.com> wrote:
>>>>
>>>> > Hello,
>>>> >    I recently set up a 2 node cassandra cluster on dedicated
>>>> hardware. In
>>>> > the logs there have been a lot of "InetAddress xxx is now dead' or UP
>>>> > messages. Comparing the log messages between the 2 nodes, they seem to
>>>> > coincide with extremely long ParNew collections. I have seem some of
>>>> up to
>>>> > 50 seconds. The installation is pretty vanilla, I didn't change any
>>>> > settings and the machines don't seem particularly busy - cassandra is
>>>> the
>>>> > only thing running on the machine with an 8GB heap. The machine has
>>>> 64GB of
>>>> > RAM and CPU/IO usage looks pretty light. I do see a lot of 'Heap is
>>>> xxx
>>>> > full. You may need to reduce memtable and/or cache sizes' messages.
>>>> Would
>>>> > this help with the long ParNew collections? That message seems to be
>>>> > triggered on a full collection.
>>>>
>>>> --
>>>> Marcus Both
>>>>
>>>>
>>>
>>
>>
>
>

Re: frequent node up/downs

Reply via email to