Re: Rebalancing issue -- failure to hand-off partitions

2013-04-03 Thread Evan Vigil-McClanahan
You may wish to make the memory limit a little smaller; that means ~20GB per node, which might put undue memory pressure on leveldb. I'd also recommend setting {max_open_files, 400} in the eleveldb section (and maybe tuning the write buffer back down), as that's important for high quantile latencie

Re: Rebalancing issue -- failure to hand-off partitions

2013-04-03 Thread Evan Vigil-McClanahan
How much data do you have in each partition? Are you running leveldb or bitcask? If the former, what does your eleveldb config look like? On Wed, Apr 3, 2013 at 6:26 AM, Giri Iyengar wrote: > Evan, > > I tried re-introducing the TTL. It reverts back to the issue of vnodes not > successfully tra

Re: Rebalancing issue -- failure to hand-off partitions

2013-03-29 Thread Giri Iyengar
Evan, As recommended by you, I disabled the TTL on the memory backends and did a rolling restart of the cluster. Now, things are rebalancing quite nicely. Do you think I can turn the TTL back on once the rebalancing completes? I'd like to ensure that the vnodes in memory don't keep growing forever

Re: Rebalancing issue -- failure to hand-off partitions

2013-03-29 Thread Evan Vigil-McClanahan
That's an interesting result. Once it's fully rebalanced, I'd turn it back on and see if the fallback handoffs still fail. If they do, I'd recommend using memory limits, rather than TTL to limit growth (also, remember that memory limits are *per vnode*, rather than per node). They're slower, but

Re: Rebalancing issue -- failure to hand-off partitions

2013-03-28 Thread Giri Iyengar
Evan, This has been happening for a while now (about 3.5 weeks now), even prior to our upgrade to 1.3. -giri On Thu, Mar 28, 2013 at 6:36 PM, Evan Vigil-McClanahan < emcclana...@basho.com> wrote: > No. AAE is unrelated to the handoff subsystem. I am not familiar > enough with the lowest level

Re: Rebalancing issue -- failure to hand-off partitions

2013-03-28 Thread Evan Vigil-McClanahan
No. AAE is unrelated to the handoff subsystem. I am not familiar enough with the lowest level of it's working to know if it'd reproduce the TTL stuff across on nodes that don't have it. I am not totally sure about your timeline here. When did you start seeing these errors, before or after your

Re: Rebalancing issue -- failure to hand-off partitions

2013-03-28 Thread Giri Iyengar
Evan, All nodes have been restarted (more than once, in fact) after the config changes. Using riak-admin aae-status, I noticed that the anti-entropy repair is still proceeding across the cluster. It has been less than 24 hours since I upgraded to 1.3 and maybe I have to wait till the first complet

Re: Rebalancing issue -- failure to hand-off partitions

2013-03-28 Thread Evan Vigil-McClanahan
Giri, if all of the nodes are using identical app.config files (including the joining node) and have been restarted since those files changed, it may be some other, related issue. On Thu, Mar 28, 2013 at 2:46 PM, Giri Iyengar wrote: > Evan, > > I reconfirmed that all the servers are using identi

Re: Rebalancing issue -- failure to hand-off partitions

2013-03-28 Thread Giri Iyengar
Evan, I reconfirmed that all the servers are using identical app.configs. They all use multi-backend schema. Are you saying that some of the vnodes are in memory backend in one physical node and in eleveldb backend in another physical node? If so, how can I fix the offending vnodes? Thanks, -gir

Re: Rebalancing issue -- failure to hand-off partitions

2013-03-28 Thread Evan Vigil-McClanahan
it would if some of the nodes weren't migrated to the new multi-backend schema; if a memory node was trying to hand off to a eleveldb backed node, you'd see this. On Thu, Mar 28, 2013 at 2:05 PM, Giri Iyengar wrote: > Evan, > > I verified that all of the memory backends have the same ttl settings

Re: Rebalancing issue -- failure to hand-off partitions

2013-03-28 Thread Giri Iyengar
Evan, I verified that all of the memory backends have the same ttl settings and have done rolling restarts but it doesn't seem to make a difference. One thing to note though -- I remember this problem starting roughly around the time I migrated a bucket from being backed by leveldb to being backed

Re: Rebalancing issue -- failure to hand-off partitions

2013-03-28 Thread Evan Vigil-McClanahan
Giri, I've seen similar issues in the past when someone was adjusting their ttl setting on the memory backend. Because one memory backend has it and the other does not, it fails on handoff. The solution then was to make sure that all memory backend settings are the same and then do a rolling res

Re: Rebalancing issue -- failure to hand-off partitions

2013-03-28 Thread Giri Iyengar
Godefroy: Thanks. Your email exchange on the mailing list was what prompted me to consider switching to Riak 1.3. I do see repair messages in the console logs and so some healing is happening. However, there are a bunch of hinted handoffs and ownership handoffs that are simply not proceeding becau

Re: Rebalancing issue -- failure to hand-off partitions

2013-03-28 Thread Godefroy de Compreignac
I have exactly the same problem with my cluster. If anyone knows what those errors mean... :-) Godefroy 2013/3/28 Giri Iyengar > Hello, > > We are running a 6-node Riak 1.3.0 cluster in production. We recently > upgraded to 1.3. Prior to this, we were running Riak 1.2 on the same 6-node > clus