Thanks much Matthew. Yes the server is low-memory given only development right now - I'm using an AWS micro instance, so 1 GB RAM and 1 vCPU.
Thanks for the tip - let me try move the manifest file to a larger instance and see how that works. More than reducing the memory footprint in dev, my concern was more around reacting to a possible production scenario where the db stops responding due to memory overload. Understood now that moving to a larger instance should be possible. Thanks again. On Tue, Jul 12, 2016 at 12:26 PM, Matthew Von-Maszewski <matth...@basho.com> wrote: > It would be helpful if you described the physical characteristics of the > servers: memory size, logical cpu count, etc. > > Google created leveldb to be highly reliable in the face of crashes. If > it is not restarting, that suggests to me that you have a low memory > condition that is not able to load leveldb's MANIFEST file. That is easily > fixed by moving the dataset to a machine with larger memory. > > There is also a special flag to reduce Riak's leveldb memory foot print > during development work. The setting reduces the leveldb performance, but > lets you run with less memory. > > In riak.conf, set: > > leveldb.limited_developer_mem = true > > Matthew > > > > On Jul 12, 2016, at 11:56 AM, Vikram Lalit <vikramla...@gmail.com> > wrote: > > > > Hi - I've been testing a Riak cluster (of 3 nodes) with an ejabberd > messaging cluster in front of it that writes data to the Riak nodes. Whilst > load testing the platform (by creating 0.5 million ejabberd users via > Tsung), I found that the Riak nodes suddenly crashed. My question is how do > we recover from such a situation if it were to occur in production? > > > > To provide further context / details, the leveldb log files storing the > data suddenly became too huge, thus making the AWS Riak instances not able > to load them in memory anymore. So we get a core dump if 'riak start' is > fired on those instances. I had an n_val = 2, and all 3 nodes went down > almost simultaneously, so in such a scenario, we cannot even rely on a 2nd > copy of the data. One way to of course prevent it in the first place would > be to use auto-scaling, but I'm wondering is there a ex post facto / post > the event recovery that can be performed in such a scenario? Is it possible > to simply copy the leveldb data to a larger memory instance, or to curtail > the data further to allow loading in the same instance? > > > > Appreciate if you can provide inputs - a tad concerned as to how we > could recover from such a situation if it were to happen in production > (apart from leveraging auto-scaling as a preventive measure). > > > > Thanks! > > > > _______________________________________________ > > riak-users mailing list > > riak-users@lists.basho.com > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com