Evan, As recommended by you, I disabled the TTL on the memory backends and did a rolling restart of the cluster. Now, things are rebalancing quite nicely. Do you think I can turn the TTL back on once the rebalancing completes? I'd like to ensure that the vnodes in memory don't keep growing forever.
-giri On Thu, Mar 28, 2013 at 6:50 PM, Giri Iyengar <giri.iyen...@sociocast.com>wrote: > Evan, > > This has been happening for a while now (about 3.5 weeks now), even prior > to our upgrade to 1.3. > > -giri > > On Thu, Mar 28, 2013 at 6:36 PM, Evan Vigil-McClanahan < > emcclana...@basho.com> wrote: > >> No. AAE is unrelated to the handoff subsystem. I am not familiar >> enough with the lowest level of it's working to know if it'd reproduce >> the TTL stuff across on nodes that don't have it. >> >> I am not totally sure about your timeline here. >> >> When did you start seeing these errors, before or after your upgrade >> to 1.3? When did you start your cluster transition? What cluster >> transitions have you initiated? >> >> If these errors started with 1.3, an interesting experiment would be >> to disable AAE and do a rolling restart of the cluster, which should >> lead to empty memory backends that won't be populated by AAE with >> anything suspicious. That said: if you've had cluster balance >> problems for a while, it's possible that these messages (even this >> whole issue) is just masking some other problem. >> >> On Thu, Mar 28, 2013 at 3:24 PM, Giri Iyengar >> <giri.iyen...@sociocast.com> wrote: >> > Evan, >> > >> > All nodes have been restarted (more than once, in fact) after the config >> > changes. Using riak-admin aae-status, I noticed that the anti-entropy >> repair >> > is still proceeding across the cluster. >> > It has been less than 24 hours since I upgraded to 1.3 and maybe I have >> to >> > wait till the first complete build of the index trees happens for the >> > cluster to start rebalancing itself. >> > Could that be the case? >> > >> > -giri >> > >> > >> > On Thu, Mar 28, 2013 at 5:49 PM, Evan Vigil-McClanahan >> > <emcclana...@basho.com> wrote: >> >> >> >> Giri, >> >> >> >> if all of the nodes are using identical app.config files (including >> >> the joining node) and have been restarted since those files changed, >> >> it may be some other, related issue. >> >> >> >> On Thu, Mar 28, 2013 at 2:46 PM, Giri Iyengar >> >> <giri.iyen...@sociocast.com> wrote: >> >> > Evan, >> >> > >> >> > I reconfirmed that all the servers are using identical app.configs. >> They >> >> > all >> >> > use multi-backend schema. Are you saying that some of the vnodes are >> in >> >> > memory backend in one physical node and in eleveldb backend in >> another >> >> > physical node? >> >> > If so, how can I fix the offending vnodes? >> >> > >> >> > Thanks, >> >> > >> >> > -giri >> >> > >> >> > On Thu, Mar 28, 2013 at 5:18 PM, Evan Vigil-McClanahan >> >> > <emcclana...@basho.com> wrote: >> >> >> >> >> >> it would if some of the nodes weren't migrated to the new >> >> >> multi-backend schema; if a memory node was trying to hand off to a >> >> >> eleveldb backed node, you'd see this. >> >> >> >> >> >> On Thu, Mar 28, 2013 at 2:05 PM, Giri Iyengar >> >> >> <giri.iyen...@sociocast.com> wrote: >> >> >> > Evan, >> >> >> > >> >> >> > I verified that all of the memory backends have the same ttl >> settings >> >> >> > and >> >> >> > have done rolling restarts but it doesn't seem to make a >> difference. >> >> >> > One >> >> >> > thing to note though -- I remember this problem starting roughly >> >> >> > around >> >> >> > the >> >> >> > time I migrated a bucket from being backed by leveldb to being >> backed >> >> >> > by >> >> >> > memory. I did this by setting the bucket properties via curl and >> let >> >> >> > Riak do >> >> >> > the migration of the objects in that bucket. Would that cause such >> >> >> > issues? >> >> >> > >> >> >> > Thanks for your help. >> >> >> > >> >> >> > -giri >> >> >> > >> >> >> > >> >> >> > On Thu, Mar 28, 2013 at 4:55 PM, Evan Vigil-McClanahan >> >> >> > <emcclana...@basho.com> wrote: >> >> >> >> >> >> >> >> Giri, I've seen similar issues in the past when someone was >> >> >> >> adjusting >> >> >> >> their ttl setting on the memory backend. Because one memory >> backend >> >> >> >> has it and the other does not, it fails on handoff. The >> solution >> >> >> >> then was to make sure that all memory backend settings are the >> same >> >> >> >> and then do a rolling restart of the cluster (ignoring a lot of >> >> >> >> errors >> >> >> >> along the way). I am not sure that this is applicable to your >> case, >> >> >> >> but it's something to look at. >> >> >> >> >> >> >> >> On Thu, Mar 28, 2013 at 10:22 AM, Giri Iyengar >> >> >> >> <giri.iyen...@sociocast.com> wrote: >> >> >> >> > Godefroy: >> >> >> >> > >> >> >> >> > Thanks. Your email exchange on the mailing list was what >> prompted >> >> >> >> > me >> >> >> >> > to >> >> >> >> > consider switching to Riak 1.3. I do see repair messages in the >> >> >> >> > console >> >> >> >> > logs >> >> >> >> > and so some healing is happening. However, there are a bunch of >> >> >> >> > hinted >> >> >> >> > handoffs and ownership handoffs that are simply not proceeding >> >> >> >> > because >> >> >> >> > the >> >> >> >> > same vnodes keep coming up for transfer and fail. Perhaps >> there is >> >> >> >> > a >> >> >> >> > manual >> >> >> >> > way to forcibly repair and push the vnodes around. >> >> >> >> > >> >> >> >> > -giri >> >> >> >> > >> >> >> >> > >> >> >> >> > On Thu, Mar 28, 2013 at 1:19 PM, Godefroy de Compreignac >> >> >> >> > <godef...@eklablog.com> wrote: >> >> >> >> >> >> >> >> >> >> I have exactly the same problem with my cluster. If anyone >> knows >> >> >> >> >> what >> >> >> >> >> those errors mean... :-) >> >> >> >> >> >> >> >> >> >> Godefroy >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> 2013/3/28 Giri Iyengar <giri.iyen...@sociocast.com> >> >> >> >> >>> >> >> >> >> >>> Hello, >> >> >> >> >>> >> >> >> >> >>> We are running a 6-node Riak 1.3.0 cluster in production. We >> >> >> >> >>> recently >> >> >> >> >>> upgraded to 1.3. Prior to this, we were running Riak 1.2 on >> the >> >> >> >> >>> same >> >> >> >> >>> 6-node >> >> >> >> >>> cluster. >> >> >> >> >>> >> >> >> >> >>> We are finding that the nodes are not balanced. For instance: >> >> >> >> >>> >> >> >> >> >>> ================================= Membership >> >> >> >> >>> ================================== >> >> >> >> >>> Status Ring Pending Node >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> ------------------------------------------------------------------------------- >> >> >> >> >>> valid 0.0% 0.0% 'riak@172.16.25.106' >> >> >> >> >>> valid 34.4% 20.3% 'riak@172.16.25.107' >> >> >> >> >>> valid 21.9% 20.3% 'riak@172.16.25.113' >> >> >> >> >>> valid 19.5% 20.3% 'riak@172.16.25.114' >> >> >> >> >>> valid 8.6% 19.5% 'riak@172.16.25.121' >> >> >> >> >>> valid 15.6% 19.5% 'riak@172.16.25.122' >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> ------------------------------------------------------------------------------- >> >> >> >> >>> Valid:6 / Leaving:0 / Exiting:0 / Joining:0 / Down:0 >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> When we look at the logs in the largest node >> >> >> >> >>> (riak@172.16.25.107), >> >> >> >> >>> we >> >> >> >> >>> see >> >> >> >> >>> error messages that look like this: >> >> >> >> >>> >> >> >> >> >>> 2013-03-28 13:04:16.957 [error] >> >> >> >> >>> <0.10957.1462>@riak_core_handoff_sender:start_fold:226 >> >> >> >> >>> hinted_handoff >> >> >> >> >>> transfer of riak_kv_vnode from 'riak@172.16.25.107' >> >> >> >> >>> 148433760041419827630061740822747494183805648896 to >> >> >> >> >>> 'riak@172.16.25.121' >> >> >> >> >>> 148433760041419827630061740822747494183805648896 failed >> because >> >> >> >> >>> of >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> error:{badmatch,{error,{worker_crash,{function_clause,[{riak_core_pb,encode,[{ts,{1364,476737,222223}},{{ts,{1364,476737,222223}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,84,73,84,89,95,83,69,83,83,109,0,0,0,36,67,54,57,95,48,48,51,56,100,56,102,50,52,49,52,99,97,97,54,102,99,52,56,53,52,99,99,101,51,98,50,48,102,53,98,52,108,0,0,0,1,104,3,100,0,9,114,95,99,111,110,116,101,110,116,104,9,100,0,4,100,105,99,116,97,5,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,108,0,0,0,2,108,0,0,0,11,109,0,0,0,12,99,111,110,116,101,110,116,45,116,121,112,101,97,116,97,101,97,120,97,116,97,47,97,112,97,108,97,97,97,105,97,110,106,108,0,0,0,23,109,0,0,0,11,88,45,82,105,97,107,45,86,84,97,103,97,51,97,120,97,105,97,101,97,120,97,66,97,120,97,107,97,119,97,101,97,75,97,117,97,122,97,111,97,55,97,85,97,104,97,85,97,107,97,112,97,120,97,107,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,5,105,110,100,101,120,106,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,20,88,45,82,105,97,107,45,76,97,115,116,45,77,111,100,105,102,105,101,100,104,3,98,0,0,5,84,98,0,7,70,65,98,0,3,99,115,106,106,108,0,0,0,1,108,0,0,0,6,109,0,0,0,7,99,104,97,114,115,101,116,97,85,97,84,97,70,97,45,97,56,106,106,109,0,0,0,36,52,54,55,98,54,51,98,50,45,50,99,56,52,45,52,56,50,99,45,97,48,99,54,45,56,53,50,100,53,99,57,97,98,98,53,101,106,108,0,0,0,1,104,2,109,0,0,0,8,0,69,155,215,81,84,63,31,104,2,97,1,110,5,0,65,191,200,202,14,106,104,9,100,0,4,100,105,99,116,97,1,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,108,0,0,0,1,108,0,0,0,1,100,0,5,99,108,101,97,110,100,0,4,116,114,117,101,106,106,100,0,9,117,110,100,101,102,105,110,101,100>>}],[{file,"src/riak_core_pb.erl"},{line,40}]},{riak_core_pb,pack,5,...},...]},...}}} >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,170}]}] >> >> >> >> >>> 2013-03-28 13:04:16.961 [error] <0.29352.909> CRASH REPORT >> >> >> >> >>> Process >> >> >> >> >>> <0.29352.909> with 0 neighbours exited with reason: no >> function >> >> >> >> >>> clause >> >> >> >> >>> matching riak_core_pb:encode({ts,{1364,476737,222223}}, >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> {{ts,{1364,476737,222223}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,...>>}) >> >> >> >> >>> line 40 in gen_server:terminate/6 line 747 >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> 2013-03-28 13:04:13.888 [error] >> >> >> >> >>> <0.12680.1435>@riak_core_handoff_sender:start_fold:226 >> >> >> >> >>> ownership_handoff >> >> >> >> >>> transfer of riak_kv_vnode from 'riak@172.16.25.107' >> >> >> >> >>> 11417981541647679048466287755595961091061972992 to >> >> >> >> >>> 'riak@172.16.25.113' >> >> >> >> >>> 11417981541647679048466287755595961091061972992 failed >> because >> >> >> >> >>> of >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> error:{badmatch,{error,{worker_crash,{function_clause,[{riak_core_pb,encode,[{ts,{1364,458917,232318}},{{ts,{1364,458917,232318}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,84,73,84,89,95,83,69,83,83,109,0,0,0,36,67,54,57,95,48,48,48,54,52,98,99,52,53,51,49,52,55,101,50,101,53,97,102,101,102,49,57,99,50,55,99,97,49,53,54,99,108,0,0,0,1,104,3,100,0,9,114,95,99,111,110,116,101,110,116,104,9,100,0,4,100,105,99,116,97,5,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,108,0,0,0,2,108,0,0,0,11,109,0,0,0,12,99,111,110,116,101,110,116,45,116,121,112,101,97,116,97,101,97,120,97,116,97,47,97,112,97,108,97,97,97,105,97,110,106,108,0,0,0,23,109,0,0,0,11,88,45,82,105,97,107,45,86,84,97,103,97,54,97,88,97,76,97,66,97,69,97,69,97,116,97,73,97,104,97,118,97,77,97,86,97,48,97,81,97,103,97,110,97,119,97,73,97,51,97,85,97,72,97,53,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,5,105,110,100,101,120,106,106,106,108,0,0,0,1,108,0,0,0,1,109,0,0,0,20,88,45,82,105,97,107,45,76,97,115,116,45,77,111,100,105,102,105,101,100,104,3,98,0,0,5,84,98,0,7,0,165,98,0,3,138,179,106,106,108,0,0,0,1,108,0,0,0,6,109,0,0,0,7,99,104,97,114,115,101,116,97,85,97,84,97,70,97,45,97,56,106,106,109,0,0,0,36,55,102,98,52,50,54,54,53,45,57,100,56,48,45,52,54,98,97,45,98,53,97,100,45,56,55,52,52,54,54,97,97,50,56,53,99,106,108,0,0,0,1,104,2,109,0,0,0,8,0,69,155,215,81,59,179,219,104,2,97,1,110,5,0,165,121,200,202,14,106,104,9,100,0,4,100,105,99,116,97,1,97,16,97,16,97,8,97,80,97,48,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,106,104,1,104,16,106,106,106,106,106,106,106,106,106,106,106,106,106,106,108,0,0,0,1,108,0,0,0,1,100,0,5,99,108,101,97,110,100,0,4,116,114,117,101,106,106,100,0,9,117,110,100,101,102,105,110,101,100>>}],[{file,"src/riak_core_pb.erl"},{line,40}]},{riak_core_pb,pack,5,[{...},...]},...]},...}}} >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> [{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,170}]}] >> >> >> >> >>> 2013-03-28 13:04:14.255 [error] <0.1120.0> CRASH REPORT >> Process >> >> >> >> >>> <0.1120.0> with 0 neighbours exited with reason: no function >> >> >> >> >>> clause >> >> >> >> >>> matching >> >> >> >> >>> riak_core_pb:encode({ts,{1364,458917,232318}}, >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> {{ts,{1364,458917,232318}},<<131,104,7,100,0,8,114,95,111,98,106,101,99,116,109,0,0,0,11,69,78,...>>}) >> >> >> >> >>> line 40 in gen_server:terminate/6 line 747 >> >> >> >> >>> >> >> >> >> >>> This has been going on for days and the cluster doesn't seem >> to >> >> >> >> >>> be >> >> >> >> >>> rebalancing itself. We see this issue with both >> hinted_handoffs >> >> >> >> >>> and >> >> >> >> >>> ownership_handoffs. Looks like we have some corrupt data in >> our >> >> >> >> >>> cluster. I >> >> >> >> >>> checked through the leveldb LOGs and did not see any >> compaction >> >> >> >> >>> errors. >> >> >> >> >>> I was hoping that upgrading to 1.3.0 will slowly start >> repairing >> >> >> >> >>> the >> >> >> >> >>> cluster. However, that doesn't seem to be happening. >> >> >> >> >>> >> >> >> >> >>> Any help/hints would be much appreciated. >> >> >> >> >>> >> >> >> >> >>> -giri >> >> >> >> >>> -- >> >> >> >> >>> GIRI IYENGAR, CTO >> >> >> >> >>> SOCIOCAST >> >> >> >> >>> Simple. Powerful. Predictions. >> >> >> >> >>> >> >> >> >> >>> 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010 >> >> >> >> >>> O: 917.525.2466x104 M: 914.924.7935 F: 347.943.6281 >> >> >> >> >>> E: giri.iyen...@sociocast.com W: www.sociocast.com >> >> >> >> >>> >> >> >> >> >>> Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ >> >> >> >> >>> >> >> >> >> >>> _______________________________________________ >> >> >> >> >>> riak-users mailing list >> >> >> >> >>> riak-users@lists.basho.com >> >> >> >> >>> >> >> >> >> >>> >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> >> >> >>> >> >> >> >> >> >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > -- >> >> >> >> > GIRI IYENGAR, CTO >> >> >> >> > SOCIOCAST >> >> >> >> > Simple. Powerful. Predictions. >> >> >> >> > >> >> >> >> > 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010 >> >> >> >> > O: 917.525.2466x104 M: 914.924.7935 F: 347.943.6281 >> >> >> >> > E: giri.iyen...@sociocast.com W: www.sociocast.com >> >> >> >> > >> >> >> >> > Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ >> >> >> >> > >> >> >> >> > _______________________________________________ >> >> >> >> > riak-users mailing list >> >> >> >> > riak-users@lists.basho.com >> >> >> >> > >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > -- >> >> >> > GIRI IYENGAR, CTO >> >> >> > SOCIOCAST >> >> >> > Simple. Powerful. Predictions. >> >> >> > >> >> >> > 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010 >> >> >> > O: 917.525.2466x104 M: 914.924.7935 F: 347.943.6281 >> >> >> > E: giri.iyen...@sociocast.com W: www.sociocast.com >> >> >> > >> >> >> > Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ >> >> > >> >> > >> >> > >> >> > >> >> > -- >> >> > GIRI IYENGAR, CTO >> >> > SOCIOCAST >> >> > Simple. Powerful. Predictions. >> >> > >> >> > 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010 >> >> > O: 917.525.2466x104 M: 914.924.7935 F: 347.943.6281 >> >> > E: giri.iyen...@sociocast.com W: www.sociocast.com >> >> > >> >> > Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ >> > >> > >> > >> > >> > -- >> > GIRI IYENGAR, CTO >> > SOCIOCAST >> > Simple. Powerful. Predictions. >> > >> > 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010 >> > O: 917.525.2466x104 M: 914.924.7935 F: 347.943.6281 >> > E: giri.iyen...@sociocast.com W: www.sociocast.com >> > >> > Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ >> > > > > -- > GIRI IYENGAR, CTO > SOCIOCAST > Simple. Powerful. Predictions. > > 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010 > O: 917.525.2466x104 M: 914.924.7935 F: 347.943.6281 > E: *giri.iyen...@sociocast.com* W: *www.sociocast.com* > > Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ > -- GIRI IYENGAR, CTO SOCIOCAST Simple. Powerful. Predictions. 36 WEST 25TH STREET, 7TH FLOOR NEW YORK CITY, NY 10010 O: 917.525.2466x104 M: 914.924.7935 F: 347.943.6281 E: *giri.iyen...@sociocast.com* W: *www.sociocast.com* Facebook's Ad Guru Joins Sociocast - http://bit.ly/NjPQBQ
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com