Re: Servers keep dying. How to understand why?

Christian Dahlqvist Fri, 17 May 2013 05:46:38 -0700

Hi Julien,

I believe from an earlier email that you are using bitcask as a backend. This 
works with immutable append-only files, and data that is deleted or overwritten 
will stay in the files and take up disk space until the file is closed and can 
be merged. The max file size is by default 2GB, but this and other parameters 
determining how and when merging of closed files is performed can be tuned. 
Please see 
http://docs.basho.com/riak/latest/tutorials/choosing-a-backend/Bitcask/ for 
further details.


If you wish to reduce the amount of disk space used, you may want to set a 
smaller max file size in order to allow merging to occur more frequently.

Best regards,

Christian



On 17 May 2013, at 13:06, Julien Genestoux <julien.genest...@gmail.com> wrote:

> Christian, All
> 
> Our servers still have not died... but we see another strange behavior: our 
> data store needs a lot more space that what we expect.
> 
> Based on the status command, the average size of our object 
> (node_get_fsm_objsize_mean) is about 1500 bytes.
> We have 2 buckets, but both of them have a n value of 3. 
> 
> When we count the values in each of the buckets (using the following 
> mapreduce)
> curl -XPOST http://192.168.134.42:8098/mapred -H 'Content-Type: 
> application/json' -d 
> '{"inputs":"BUCKET","query":[{"reduce":{"language":"erlang","module":"riak_kv_mapreduce","function":"reduce_count_inputs","arg":{"do_prereduce":true}}}],"timeout":
>  100000}'
> 
> We get 194556 for one and 1572661 for the other one (these numbers are 
> consistent with what we expected), so if our math is right, we do need a 
> total disk of 
> 3 * (194556 + 1572661 ) * 1500 bytes = 7.4 GB.
> 
> Now, though, when I inspect the storage actually occupied on our hard drives, 
> we see something weird:
> (this is the du output)
> riak1. 2802888 /var/lib/riak
> riak2. 4159976 /var/lib/riak
> riak5. 4603312 /var/lib/riak
> riak3. 4915180 /var/lib/riak
> riak4. 37466784  /var/lib/riak
> 
> As you can see not all nodes have the same "size". What's even weirder is 
> that up until a couple hours ago, they were all growing "together" and close 
> to what the riak4 node shows. Could this be due to the "delete" policy? It 
> turns out that we delete a lot of items (is there a way to get the list of 
> commands sent to a node/cluster?)
> 
> Thanks!
> 
> 
> 
> On Wed, May 15, 2013 at 11:29 PM, Julien Genestoux 
> <julien.genest...@gmail.com> wrote:
> Christian, all,
> 
> Not sure what kind of magic happend, but no server died in the last 2 days... 
> and counting.
> We have not changed a single line of code, which is quite odd...
> I'm still monitoring everything and hope (sic!) for a failure soon so we can 
> fix the problem!
> 
> Thanks
> 
> 
> 
> 
> --
> Got a blog? Make following it simple: https://www.subtome.com/
> 
> Julien Genestoux,
> http://twitter.com/julien51
> 
> +1 (415) 830 6574
> +33 (0)9 70 44 76 29
> 
> 
> On Tue, May 14, 2013 at 12:31 PM, Julien Genestoux 
> <julien.genest...@gmail.com> wrote:
> Thanks Christian. 
> We do indeed use mapreduce but it's a fairly simple function:
> We retrieve a first object whose value is an array of at most 10 ids and then 
> we fetch all the values for these 10 ids.
> However, this mapreduce job is quite rare (maybe 10 times a day at most at 
> this point...) so I don't think that's our issue.
> I'll try to run the cluster without any call to that to see if that's better, 
> but I'd be very surprised.  Also, we were doing this already even 
> before we allowed for multiple value and the cluster was stable back then.
> We do not do key listing or anything like that.
> 
> I'll try looking at the statistics too.
> 
> Thanks,
> 
> 
> 
> 
> On Tue, May 14, 2013 at 11:50 AM, Christian Dahlqvist <christ...@basho.com> 
> wrote:
> Hi Julien,
> 
> The node appear to have crashed due to inability to allocate memory. How are 
> you accessing your data? Are you running any key listing or large MapReduce 
> jobs that could use up a lot of memory?
> 
> In order to ensure that you are efficiently resolving siblings I would 
> recommend you monitor the statistics in Riak 
> (http://docs.basho.com/riak/latest/cookbooks/Statistics-and-Monitoring/). 
> Specifically look at node_get_fsm_objsize_* and node_get_fsm_siblings_* 
> statistics in order to identify objects that are very large or have lots of 
> siblings.
> 
> Best regards,
> 
> Christian
> 
> 
> 
> On 13 May 2013, at 16:44, Julien Genestoux <julien.genest...@gmail.com> wrote:
> 
>> Christian, All,
>> 
>> Bad news: my laptop is completely dead. Good news: I have a new one, and 
>> it's now fully operational (backups FTW!).
>> 
>> The log files have finally been uploaded: 
>> https://www.dropbox.com/s/j7l3lniu0wogu29/riak-died.tar.gz
>> 
>> I have attached to that mail our config.
>> 
>> The machine is a virtual Xen instance at Linode with 4GB of memory. I know 
>> it's probably not the very best setup, but 1) we're on a budget and 2) we 
>> assumed that would fit our needs quite well.
>> 
>> Just to put things in more details. Initially we did not use allow_mult and 
>> things worked out fine for a couple of days. As soon as we enabled 
>> allow_mult, we were not able to run the cluster for more then 5 hours 
>> without seeing failing nodes, which is why I'm convinced we must be doing 
>> something wrong. The question is: what? 
>> 
>> Thanks
>> 
>> 
>> On Sun, May 12, 2013 at 8:07 PM, Christian Dahlqvist <christ...@basho.com> 
>> wrote:
>> Hi Julien,
>> 
>> I was not able to access the logs based on the link you provided.
>> 
>> Could you please attach a copy of your app.config file so we can get a 
>> better understanding of the configuration of your cluster? Also, what is the 
>> specification of the machines in the cluster?
>> 
>> How much data do you have in the cluster and how are you querying it?
>> 
>> Best regards,
>> 
>> Christian
>> 
>> 
>> 
>> On 12 May 2013, at 19:11, Julien Genestoux <julien.genest...@gmail.com> 
>> wrote:
>> 
>>> Hi,
>>> 
>>> We are running a cluster of 5 servers, or at least trying to, because nodes 
>>> seem to be dying 'randomly'
>>> without us knowing any reason why. We don't have a great Erlang guy aboard, 
>>> and the error logs are not
>>> that verbose.
>>> So I've just .tgz the whole log directory and I was hoping somebody could 
>>> give us a clue.
>>> It's there: https://www.dropbox.com/s/z9ezv0qlxgfhcyq/riak-died.tar.gz 
>>> (might not be fully uploaded to dropbox yet!)
>>> 
>>> I've looked at the archive and some people said their server was dying 
>>> because some object's size was just 
>>> too big to allocate the whole memory. Maybe that's what we're seeing?
>>> 
>>> As one of our buckets is set with allow_mult, I am tempted to think that 
>>> some object's size may be exploding.
>>> However, we do actually try to resolve conflicts in our code. Any idea how 
>>> to confirm and then debug that we 
>>> have an issue there?
>>> 
>>> 
>>> Thanks a lot for your precious help...
>>> 
>>> Julien
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users@lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> 
>> 
>> <app.config>
> 
> 
> 
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Servers keep dying. How to understand why?

Reply via email to