On Tue, Oct 2, 2012 at 8:51 AM, Shane McEwan <sh...@mcewan.id.au> wrote: > Thanks John and Kelly. It's nice to know we're not the only ones. :-) > > As I said, we'll be upgrading to 1.2 in the coming weeks so it's good to > know that the memory issues might go away after that. It's not a showstopper > for us, more of a curiosity and concern it might develop into something > worse. > > I'll persist with the etop and see if I can get it to run and will report > back. > > We're still using key filters in our MapReduce functions but we plan to move > to 2i at the same time as upgrading to 1.2. > > The word "monitor" doesn't appear in any of our logs for the last 5 days. > Just lots of: > > 2012-10-02 00:10:47.869 [error] <0.31890.1344> gen_fsm <0.31890.1344> in > state wait_pipeline_shutdown terminated with reason: {sink_died,normal} > 2012-10-02 00:10:47.909 [error] <0.31890.1344> CRASH REPORT Process > <0.31890.1344> with 0 neighbours crashed with reason: {sink_died,normal} > 2012-10-02 00:10:47.981 [error] <0.166.0> Supervisor riak_pipe_builder_sup > had child undefined started with {riak_pipe_builder,start_link,undefined} at > <0.31890.1344> exit with reason {sink_died,normal} in context > child_terminated > > Thanks! >
We had this same error before the upgrade. It's much less noisy now but same thing - sink_died > > On 02/10/12 15:55, Kelly McLaughlin wrote: >> >> John and Shane, >> >> I have been looking into some memory issues lately and I would be very >> interested in more >> information about your particular problems. If either of you are able to >> get some output >> from etop using the -sort memory option when you are having elevated >> memory usage it >> would be very helpful to see. I know that sometimes you get the >> connection_lost message >> when trying to use etop, but I have found that sometimes if you keep >> trying it may succeed >> after a few attempts. >> >> Are either of you using MapReduce? I see that John is using 2I. Shane, do >> you also use 2I? >> Finally, do you notice a lot of messages to the console or console log >> that have the either the >> phrase 'monitor large_heap' or 'monitor long_gc'? >> >> Kelly >> >> On Oct 2, 2012, at 6:11 AM, "John E. Vincent" >> <lusis.org+riak-us...@gmail.com> wrote: >> >>> I would highly suggest you upgrade to 1.2 when possible. We were, up >>> until recently, running on 1.4 and seeing the same problems you >>> describe. Take a look at this graph: >>> >>> http://i.imgur.com/0RtsU.png >>> >>> That's just one of our nodes but all of them exhibited the same >>> behavior. The falloffs are where we had to bounce riak. >>> >>> This is what one of our nodes looks like now and has looked like since >>> the upgrade: >>> >>> http://i.imgur.com/pm7Nk.png >>> >>> The change was SO dramatic that I seriously though /stats was broken. >>> I've verified outside of Riak and inside. The memory usage change was >>> very positive. Evidently there's even still a memory leak. >>> >>> We're heavy 2i users. No multi backend. >>> >>> On Tue, Oct 2, 2012 at 4:08 AM, Shane McEwan <sh...@mcewan.id.au> wrote: >>>> >>>> G'day! >>>> >>>> Just recently we've noticed memory usage in our Riak cluster constantly >>>> increasing. >>>> >>>> The memory usage reported by the Riak stats "memory_total" parameter has >>>> been less than 100MB for nearly a year but has recently increased to >>>> over >>>> 1GB. >>>> >>>> If we restart the cluster memory usage usually returns back to what we >>>> would >>>> call "normal" but after a week or so of stability the memory usage >>>> starts >>>> gradually growing again. Sometimes after a growth spurt over a few days >>>> the >>>> memory usage will plateau and be stable again for a week or two and then >>>> put >>>> on another growth spurt. The memory usage starts increasing at the same >>>> moment on all 4 nodes. >>>> >>>> This graph [http://imagebin.org/230614] shows what I mean. The green >>>> shows >>>> the memory usage as reported by "memory_total" (left-hand y-axis scale). >>>> The >>>> red line shows the memory used by Riak's beam.smp process (right-hand >>>> y-axis >>>> scale). >>>> >>>> Also notice that the gradient of the recent growth seems to be >>>> increasing >>>> compared to the memory increases we had in August. >>>> >>>> We might have just assumed that the memory usage was normal Riak >>>> behaviour. >>>> Perhaps we have just tipped over some sort of internal buffer or cache >>>> and >>>> that causes some more memory to be allocated. However, whenever we >>>> notice >>>> the memory usage increasing it always coincides with the "riak-admin >>>> top" >>>> command failing to run. >>>> >>>> We try to run "riak-admin top" to diagnose what is using the memory but >>>> it >>>> returns: "Output server crashed: connection_lost". If we restart the >>>> cluster >>>> the top command works fine (but, of course, there's nothing interesting >>>> to >>>> see after a restart!). >>>> >>>> So our theory at the moment is that some sort of instability or race >>>> condition is causing Riak to start consuming more and more memory. A >>>> side >>>> effect of this instability is that the internal processes needed for >>>> running >>>> the top command are not working correctly. The actual functionality of >>>> Riak >>>> doesn't seem to be affected. Our application is running fine. We see a >>>> slight increase in "FSM Put" times and CPU usage during the memory >>>> growth >>>> phases but all other parameters we're monitoring on the system seem >>>> unaffected. >>>> >>>> There's nothing abnormal in the logs. We get a lot of >>>> "riak_pipe_builder_sup >>>> {sink_died,normal}" messages but they can be ignored, apparently. The >>>> cluster is under constant load so we would expect to see either gradual >>>> memory increase or a steady state but not both. Erlang process count, >>>> open >>>> file handles, etc are stable. >>>> >>>> So I was wondering if anyone has seen similar behaviour before? >>>> Is there anything else we can do to diagnose the problem? >>>> I'm accessing the stats URL once per minute, could that have any side >>>> effects? >>>> We'll be upgrading to Riak 1.2 and new hardware in the next few weeks so >>>> should we just ignore it and hope it goes away? >>>> Any other ideas? >>>> Or is this just normal? >>>> >>>> Riak config: >>>> 4 VMware nodes >>>> ring_creation_size, 256 >>>> n_val, 3 >>>> eleveldb backend: >>>> max_open_files, 20 >>>> cache_size, 15728640 >>>> "riak_kv_version":"1.1.1", >>>> "riak_core_version":"1.1.1", >>>> "stdlib_version":"1.17.4", >>>> "kernel_version":"2.14.4" >>>> Erlang R14B03 (erts-5.8.4) >>>> >>>> Thanks! >>>> >>>> Shane. >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> riak-users mailing list >>>> riak-users@lists.basho.com >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >>> >>> _______________________________________________ >>> riak-users mailing list >>> riak-users@lists.basho.com >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> >> >> _______________________________________________ >> riak-users mailing list >> riak-users@lists.basho.com >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com