On Tue, Oct 2, 2012 at 8:51 AM, Shane McEwan <sh...@mcewan.id.au> wrote:
> Thanks John and Kelly. It's nice to know we're not the only ones. :-)
>
> As I said, we'll be upgrading to 1.2 in the coming weeks so it's good to
> know that the memory issues might go away after that. It's not a showstopper
> for us, more of a curiosity and concern it might develop into something
> worse.
>
> I'll persist with the etop and see if I can get it to run and will report
> back.
>
> We're still using key filters in our MapReduce functions but we plan to move
> to 2i at the same time as upgrading to 1.2.
>
> The word "monitor" doesn't appear in any of our logs for the last 5 days.
> Just lots of:
>
> 2012-10-02 00:10:47.869 [error] <0.31890.1344> gen_fsm <0.31890.1344> in
> state wait_pipeline_shutdown terminated with reason: {sink_died,normal}
> 2012-10-02 00:10:47.909 [error] <0.31890.1344> CRASH REPORT Process
> <0.31890.1344> with 0 neighbours crashed with reason: {sink_died,normal}
> 2012-10-02 00:10:47.981 [error] <0.166.0> Supervisor riak_pipe_builder_sup
> had child undefined started with {riak_pipe_builder,start_link,undefined} at
> <0.31890.1344> exit with reason {sink_died,normal} in context
> child_terminated
>
> Thanks!
>

We had this same error before the upgrade. It's much less noisy now
but same thing - sink_died
>
> On 02/10/12 15:55, Kelly McLaughlin wrote:
>>
>> John and Shane,
>>
>> I have been looking into some memory issues lately and I would be very
>> interested in more
>> information about your particular problems. If either of you are able to
>> get some output
>> from etop using the -sort memory option when you are having elevated
>> memory usage it
>> would be very helpful to see. I know that sometimes you get the
>> connection_lost message
>> when trying to use etop, but I have found that sometimes if you keep
>> trying it may succeed
>> after a few attempts.
>>
>> Are either of you using MapReduce? I see that John is using 2I. Shane, do
>> you also use 2I?
>> Finally, do you notice a lot of messages to the console or console log
>> that have the either the
>> phrase 'monitor large_heap' or 'monitor long_gc'?
>>
>> Kelly
>>
>> On Oct 2, 2012, at 6:11 AM, "John E. Vincent"
>> <lusis.org+riak-us...@gmail.com> wrote:
>>
>>> I would highly suggest you upgrade to 1.2 when possible. We were, up
>>> until recently, running on 1.4 and seeing the same problems you
>>> describe. Take a look at this graph:
>>>
>>> http://i.imgur.com/0RtsU.png
>>>
>>> That's just one of our nodes but all of them exhibited the same
>>> behavior. The falloffs are where we had to bounce riak.
>>>
>>> This is what one of our nodes looks like now and has looked like since
>>> the upgrade:
>>>
>>> http://i.imgur.com/pm7Nk.png
>>>
>>> The change was SO dramatic that I seriously though /stats was broken.
>>> I've verified outside of Riak and inside. The memory usage change was
>>> very positive. Evidently there's even still a memory leak.
>>>
>>> We're heavy 2i users. No multi backend.
>>>
>>> On Tue, Oct 2, 2012 at 4:08 AM, Shane McEwan <sh...@mcewan.id.au> wrote:
>>>>
>>>> G'day!
>>>>
>>>> Just recently we've noticed memory usage in our Riak cluster constantly
>>>> increasing.
>>>>
>>>> The memory usage reported by the Riak stats "memory_total" parameter has
>>>> been less than 100MB for nearly a year but has recently increased to
>>>> over
>>>> 1GB.
>>>>
>>>> If we restart the cluster memory usage usually returns back to what we
>>>> would
>>>> call "normal" but after a week or so of stability the memory usage
>>>> starts
>>>> gradually growing again. Sometimes after a growth spurt over a few days
>>>> the
>>>> memory usage will plateau and be stable again for a week or two and then
>>>> put
>>>> on another growth spurt. The memory usage starts increasing at the same
>>>> moment on all 4 nodes.
>>>>
>>>> This graph [http://imagebin.org/230614] shows what I mean. The green
>>>> shows
>>>> the memory usage as reported by "memory_total" (left-hand y-axis scale).
>>>> The
>>>> red line shows the memory used by Riak's beam.smp process (right-hand
>>>> y-axis
>>>> scale).
>>>>
>>>> Also notice that the gradient of the recent growth seems to be
>>>> increasing
>>>> compared to the memory increases we had in August.
>>>>
>>>> We might have just assumed that the memory usage was normal Riak
>>>> behaviour.
>>>> Perhaps we have just tipped over some sort of internal buffer or cache
>>>> and
>>>> that causes some more memory to be allocated. However, whenever we
>>>> notice
>>>> the memory usage increasing it always coincides with the "riak-admin
>>>> top"
>>>> command failing to run.
>>>>
>>>> We try to run "riak-admin top" to diagnose what is using the memory but
>>>> it
>>>> returns: "Output server crashed: connection_lost". If we restart the
>>>> cluster
>>>> the top command works fine (but, of course, there's nothing interesting
>>>> to
>>>> see after a restart!).
>>>>
>>>> So our theory at the moment is that some sort of instability or race
>>>> condition is causing Riak to start consuming more and more memory. A
>>>> side
>>>> effect of this instability is that the internal processes needed for
>>>> running
>>>> the top command are not working correctly. The actual functionality of
>>>> Riak
>>>> doesn't seem to be affected. Our application is running fine. We see a
>>>> slight increase in "FSM Put" times and CPU usage during the memory
>>>> growth
>>>> phases but all other parameters we're monitoring on the system seem
>>>> unaffected.
>>>>
>>>> There's nothing abnormal in the logs. We get a lot of
>>>> "riak_pipe_builder_sup
>>>> {sink_died,normal}" messages but they can be ignored, apparently. The
>>>> cluster is under constant load so we would expect to see either gradual
>>>> memory increase or a steady state but not both. Erlang process count,
>>>> open
>>>> file handles, etc are stable.
>>>>
>>>> So I was wondering if anyone has seen similar behaviour before?
>>>> Is there anything else we can do to diagnose the problem?
>>>> I'm accessing the stats URL once per minute, could that have any side
>>>> effects?
>>>> We'll be upgrading to Riak 1.2 and new hardware in the next few weeks so
>>>> should we just ignore it and hope it goes away?
>>>> Any other ideas?
>>>> Or is this just normal?
>>>>
>>>> Riak config:
>>>> 4 VMware nodes
>>>> ring_creation_size, 256
>>>> n_val, 3
>>>> eleveldb backend:
>>>>   max_open_files, 20
>>>>   cache_size, 15728640
>>>> "riak_kv_version":"1.1.1",
>>>> "riak_core_version":"1.1.1",
>>>> "stdlib_version":"1.17.4",
>>>> "kernel_version":"2.14.4"
>>>> Erlang R14B03 (erts-5.8.4)
>>>>
>>>> Thanks!
>>>>
>>>> Shane.
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> riak-users@lists.basho.com
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users@lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users@lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to