Re: Servers keep dying. How to understand why?

Christian Dahlqvist Tue, 14 May 2013 02:53:04 -0700

Hi Julien,

The node appear to have crashed due to inability to allocate memory. How are 
you accessing your data? Are you running any key listing or large MapReduce 
jobs that could use up a lot of memory?


In order to ensure that you are efficiently resolving siblings I would 
recommend you monitor the statistics in Riak 
(http://docs.basho.com/riak/latest/cookbooks/Statistics-and-Monitoring/). 
Specifically look at node_get_fsm_objsize_* and node_get_fsm_siblings_* 
statistics in order to identify objects that are very large or have lots of 
siblings.

Best regards,

Christian



On 13 May 2013, at 16:44, Julien Genestoux <julien.genest...@gmail.com> wrote:

> Christian, All,
> 
> Bad news: my laptop is completely dead. Good news: I have a new one, and it's 
> now fully operational (backups FTW!).
> 
> The log files have finally been uploaded: 
> https://www.dropbox.com/s/j7l3lniu0wogu29/riak-died.tar.gz
> 
> I have attached to that mail our config.
> 
> The machine is a virtual Xen instance at Linode with 4GB of memory. I know 
> it's probably not the very best setup, but 1) we're on a budget and 2) we 
> assumed that would fit our needs quite well.
> 
> Just to put things in more details. Initially we did not use allow_mult and 
> things worked out fine for a couple of days. As soon as we enabled 
> allow_mult, we were not able to run the cluster for more then 5 hours without 
> seeing failing nodes, which is why I'm convinced we must be doing something 
> wrong. The question is: what? 
> 
> Thanks
> 
> 
> On Sun, May 12, 2013 at 8:07 PM, Christian Dahlqvist <christ...@basho.com> 
> wrote:
> Hi Julien,
> 
> I was not able to access the logs based on the link you provided.
> 
> Could you please attach a copy of your app.config file so we can get a better 
> understanding of the configuration of your cluster? Also, what is the 
> specification of the machines in the cluster?
> 
> How much data do you have in the cluster and how are you querying it?
> 
> Best regards,
> 
> Christian
> 
> 
> 
> On 12 May 2013, at 19:11, Julien Genestoux <julien.genest...@gmail.com> wrote:
> 
>> Hi,
>> 
>> We are running a cluster of 5 servers, or at least trying to, because nodes 
>> seem to be dying 'randomly'
>> without us knowing any reason why. We don't have a great Erlang guy aboard, 
>> and the error logs are not
>> that verbose.
>> So I've just .tgz the whole log directory and I was hoping somebody could 
>> give us a clue.
>> It's there: https://www.dropbox.com/s/z9ezv0qlxgfhcyq/riak-died.tar.gz 
>> (might not be fully uploaded to dropbox yet!)
>> 
>> I've looked at the archive and some people said their server was dying 
>> because some object's size was just 
>> too big to allocate the whole memory. Maybe that's what we're seeing?
>> 
>> As one of our buckets is set with allow_mult, I am tempted to think that 
>> some object's size may be exploding.
>> However, we do actually try to resolve conflicts in our code. Any idea how 
>> to confirm and then debug that we 
>> have an issue there?
>> 
>> 
>> Thanks a lot for your precious help...
>> 
>> Julien
>> 
>> 
>> 
>> _______________________________________________
>> riak-users mailing list
>> riak-users@lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> <app.config>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Servers keep dying. How to understand why?

Reply via email to