Re: Understanding read_repairs

Russell Brown Fri, 01 Mar 2013 09:47:48 -0800

On 1 Mar 2013, at 17:39, Belai Beshah <belai.bes...@nwgeo.com> wrote:


> Nothing fancy really the set method throws an exception  
> "com.basho.riak.client.RiakRetryFailedException: java.io.EOFException". Tried 
> to find anything that could explain it in the error or console logs but 
> nothing.

Some questions:

Are you using the PB client? Do you have anything in your riak logs that points 
at a pb socket crash? What version of the RJC are you using, please?

Cheers

Russell

> 
> ________________________________________
> From: Kresten Krab Thorup [k...@trifork.com]
> Sent: Friday, March 01, 2013 5:40 AM
> To: Belai Beshah
> Cc: Jared Morrow; riak-users@lists.basho.com; Russell Brown
> Subject: Re: Understanding read_repairs
> 
> Interesting. What does the failure look like?
> 
> Kresten
> 
> On Feb 27, 2013, at 11:25 PM, Belai Beshah 
> <belai.bes...@nwgeo.com<mailto:belai.bes...@nwgeo.com>> wrote:
> 
> I see my post is not clear, the 0.1% is a get/set failure not slowdown. We 
> will have been ok with a slow response but a failed response from the AAE was 
> not something we can tolerate. Since the Java client by deafult does 3 
> retiries I didn't see any point in adding more retries to see if it works 
> with more.
> 
> ________________________________
> From: Jared Morrow [ja...@basho.com<mailto:ja...@basho.com>]
> Sent: Wednesday, February 27, 2013 2:21 PM
> To: Belai Beshah
> Cc: Russell Brown; 
> riak-users@lists.basho.com<mailto:riak-users@lists.basho.com>
> Subject: Re: Understanding read_repairs
> 
> Belai,
> 
> Active Anti-Entropy is doing work building trees and checking data, so it 
> will slow down gets/puts slightly.  If you can't accept the slight 
> performance hit, disabling it is the right choice.  In our testing, if you 
> use eLevelDB, 1.3.0 with AAE enabled is faster than 1.2.1 without AAE in most 
> cases due to the other speedups added to eLevelDB in 1.3.0.  Since Bitcask 
> runs about the limit of what a filesystem can handle, AAE definitely shows a 
> slight performance hit since it is accessing the filesystem as well.
> 
> Glad to hear the patch solved your other issues.
> 
> -Jared
> 
> 
> 
> On Wed, Feb 27, 2013 at 1:13 PM, Belai Beshah 
> <belai.bes...@nwgeo.com<mailto:belai.bes...@nwgeo.com>> wrote:
> Patch worked good on 1.3, no more continuous read repairs. However, we 
> started seeing problems with Set/Get of about 0.1% which was not there in the 
> 1.2 release. Since this happens even without the patch on a clean 1.3  
> install we narrowed it down to being Active Anti-Entropy since it looks like 
> it is always actively fixing data, may it is our write and read immediately 
> pattern or the fact that we have only a single 4TB disk behind each node and 
> they can't keep up. With Active Anti-Entropy turned off all our tests passed 
> and performance returned to 1.2 levels without any read repairs. For now we 
> are happy to continue our tests with Active Anti-Entropy turned off but it 
> will be great if we can get some pointer from the experts that could explain 
> the behavior we saw. Thanks you guys for the help.
> 
> ________________________________
> From: Jared Morrow [ja...@basho.com<mailto:ja...@basho.com>]
> Sent: Friday, February 22, 2013 11:56 AM
> To: Belai Beshah
> Cc: Russell Brown; 
> riak-users@lists.basho.com<mailto:riak-users@lists.basho.com>
> Subject: Re: Understanding read_repairs
> 
> 
> Belai,
> 
> One other option is to use our "basho-patches" functionality. We use it to 
> run new code on current installations where sending a new .beam file is 
> easier than remaking the packages or compiling from source. On your ubuntu 
> system using our packages, the folder should be in 
> /usr/lib/riak/lib/basho-patches.
> 
> To do this you just need the one file changed in the PR pointed to by Russell.
> 
> Here are the steps to make that happen:
> 
> *   Install Erlang R15B01: 
> http://docs.basho.com/riak/latest/tutorials/installation/Installing-Erlang/
> *   Get riak_kv: git clone 
> git://github.com/basho/riak_kv.git<http://github.com/basho/riak_kv.git>
> *   compile riak_kv with just 'make'
> *   copy the resulting .beam file in the ebin folder to the machines you need 
> the new file:scp ebin/riak_kv_vnode.beam 
> user@myriaknode:/usr/lib/riak/lib/basho-patches
> *   stop each node and restart them one at a time
> *   If you want to convince yourself you are using the new code, you can do a 
> 'riak attach' to attach to the node and run code:which('riak_kv_vnode'). 
> (Don't forget the '.' at the end)
> 
> For example on my dev install here is the command before the file is in 
> basho-patches:
> 
> (dev2@127.0.0.1<mailto:dev2@127.0.0.1>)1> code:which('riak_kv_vnode').
> ".../lib/riak_kv-1.3.0/ebin/riak_kv_vnode.beam"
> 
> Here is the command after I put the .beam in the basho-patches directory:
> 
> (dev2@127.0.0.1<mailto:dev2@127.0.0.1>)1> code:which('riak_kv_vnode').
> ".../lib/basho-patches/riak_kv_vnode.beam"
> 
> Notice the path of the code changed from .../riak_kv-1.3.0/... to 
> .../basho-patches/...
> 
> That might seem like a lot of work, but it is a really handy and useful 
> trick/skill that you might use quite a bit down the road.
> 
> Hope that helps,
> Jared
> 
> 
> On Fri, Feb 22, 2013 at 10:25 AM, Belai Beshah 
> <belai.bes...@nwgeo.com<mailto:belai.bes...@nwgeo.com>> wrote:
> Thanks Russel, that looks like exactly the problem we saw. I have never built 
> riak from source before but I will give it a try it this weekend.
> 
> ________________________________________
> From: Russell Brown [russell.br...@me.com<mailto:russell.br...@me.com>]
> Sent: Friday, February 22, 2013 1:24 AM
> To: Belai Beshah
> Cc: riak-users@lists.basho.com<mailto:riak-users@lists.basho.com>
> Subject: Re: Understanding read_repairs
> 
> Hi,
> Thanks for trying Riak.
> 
> On 21 Feb 2013, at 23:48, Belai Beshah 
> <belai.bes...@nwgeo.com<mailto:belai.bes...@nwgeo.com>> wrote:
> 
>> Hi All,
>> 
>> We are evaluating Riak to see if it can be used to cache large blobs of 
>> data. Here is our test cluster setup:
>> 
>>      • six Ubuntu LTS 12.04 dedicated nodes with 8 core 2.6 Ghz CPU, 32 GB 
>> RAM, 3.6T disk
>>      • {pb_backlog, 64},
>>      • {ring_creation_size, 256},
>>      • {default_bucket_props, [{n_val, 2}, 
>> {allow_mult,false},{last_write_wins,true}]},
>>      • using bitcask as the backend
>> 
>> Everything else default except the above. There is an HAProxy load balancer 
>> infront of the nodes that the clients talk too configured according to the 
>> basho wiki. Due to the nature of the application we are integrating we do 
>> about 1200/s writes of approximately 40-50KB each and read them back almost 
>> immediately. We noticed a lot of read repairs and since that was one of the 
>> things that could indicate performance problem we go worried. So we wrote a 
>> simple java client application that simulates our use case. The test program 
>> is dead simple:
>>      • generate keys using random UUID and value using Apache commons 
>> RandomStringUtils
>>      • create a thread pool of 5 and store key/value using “bucket.store()”
>>      • read the values back using “bucket.fetch()” multiple times
>> I could provide the spike code if needed. What we noticed is that we get a 
>> lot of read repairs all over the place. We even made it use a single thread 
>> to read/write, played with the write/read quorum and even put a delay of 5 
>> minutes between the writes before the reads start to give the cluster time 
>> to be eventually consistent. Nothing helps, we always see a lot of read 
>> repairs, sometime as many as the number of inserts.
> 
> 
> It sounds like you are experiencing this bug 
> https://github.com/basho/riak_kv/pull/334
> 
> It is fixed in master, but it doesn't look like it made it into 1.3.0. If 
> you're ok with building from source, I tried it and a patch from 
> 8895d2877576af2441bee755028df1a6cf2174c7 goes cleanly onto 1.3.0.
> 
> Cheers
> 
> Russell
> 
> 
>> The good thing is that in all of these tests we have not seen any read 
>> failures. Performance is also not bad, a few maxs here and there we don't 
>> like but 90% looks good. Even when we killed a node, the reads are still 
>> successful.
>> 
>> We are wondering what the expected ratio of read repairs is and what is a 
>> reasonable time for the cluster not to restore to read_repair to fulfill a 
>> read request or is there something we are missing in our setup.
>> 
>> Thanks
>> Belai
>> _______________________________________________
>> riak-users mailing list
>> riak-users@lists.basho.com<mailto:riak-users@lists.basho.com>
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com<mailto:riak-users@lists.basho.com>
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com<mailto:riak-users@lists.basho.com>
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Understanding read_repairs

Reply via email to