Responses inline.
On Tue, Aug 11, 2015 at 12:53 PM, changmao wang <wang.chang...@gmail.com> wrote: > 1. About backuping new nodes of four and then using 'riak-admin > force-replace'. what's the status of new added nodes? > as you know, we want to replace one of leaving nodes. > I don't understand the question. Doing 'riak-admin force-replace' on one of the nodes that's leaving should overwrite the leave request and tell it to change its node id / ip address. (If that doesn't work, stop the leaving node, and do a 'riak-admin reip' command instead). > 2. what's the risk of 'riak-admin force-remove' 'riak@10.21.136.91' > without backup? > As you know, now the node(riak@10.21.136.91) is a member of the cluster, > and keeping almost 2.5TB data, maybe 10 percent of the whole cluster. > The only reason I asked about backup is because it sounded like you cleared the disk on it. If it currently has the data, then it'll be fine. Force-remove just changes the IP address, and doesn't delete the data or anything. On Tue, Aug 11, 2015 at 7:32 PM, Dmitri Zagidulin <dzagidu...@basho.com> wrote: > 1. How to force leave "leaving"'s nodes without data loss? > > This depends on - did you back up the data directory of the 4 new nodes, > before you reformatted them? > If you backed them up (and then restored the data directory once you > reformatted them), you can try: > > riak-admin force-replace 'riak@10.21.136.91' 'riak@<whatever your new ip > address is for that node>' > (same for the other 3) > > If you did not back up those nodes, the only thing you can do is force > them to leave, and then join the new ones. So, for each of the 4: > > riak-admin force-remove 'riak@10.21.136.91' 'riak@10.21.136.66' > (same for the other 3) > > In either case, after force-replacing or force-removing, you have to join > the new nodes to the cluster, before you commit. > > riak-admin join 'riak@new node' 'riak@10.21.136.66' > (same for the other 3) > and finally: > riak-cluster plan > riak-cluster commit > > As for the error, the reason you're seeing it, is because the other nodes > can't contact the 4 that are supposed to be leaving. (Since you wiped them). > The amount of time that passed doesn't matter, the cluster will be waiting > for those nodes to leave indefinitely, unless you force-remove or > force-replace. > > > > On Tue, Aug 11, 2015 at 1:32 AM, changmao wang <wang.chang...@gmail.com> > wrote: > >> HI Dmitri, >> >> For your question, >> 3) Re-formatted those four nodes and re-installed Riak. Here is where it >> gets tricky though. Several questions for you: >> - Did you attempt to re-join those 4 reinstalled nodes into the cluster? >> What was the output of the cluster join and cluster plan commands? >> - Did the IP address change, after they were reformatted? If so, you >> probably need to use something like 'reip' at this point: >> http://docs.basho.com/riak/latest/ops/running/tools/riak-admin/#reip >> >> I did NOT try to re-join those 4 re-join those 4 reinstalled nodes into >> the cluster. As you know, member-status shows 'they're leaving" as below: >> riak-admin member-status >> ================================= Membership >> ================================== >> Status Ring Pending Node >> >> ------------------------------------------------------------------------------- >> leaving 10.9% 10.9% 'riak@10.21.136.91' >> leaving 9.4% 10.9% 'riak@10.21.136.92' >> leaving 7.8% 10.9% 'riak@10.21.136.93' >> leaving 7.8% 10.9% 'riak@10.21.136.94' >> valid 10.9% 10.9% 'riak@10.21.136.66' >> valid 10.9% 10.9% 'riak@10.21.136.71' >> valid 14.1% 10.9% 'riak@10.21.136.76' >> valid 17.2% 12.5% 'riak@10.21.136.81' >> valid 10.9% 10.9% 'riak@10.21.136.86' >> >> ------------------------------------------------------------------------------- >> Valid:5 / Leaving:4 / Exiting:0 / Joining:0 / Down:0 >> >> two weeks elapsed, 'riak-admin member-status' shows same result. I don't >> know which step ring hand off? >> >> I did not changed the IP address of four newly adding nodes. >> >> My questions: >> >> 1. How to force leave "leaving"'s nodes without data loss? >> 2. I have found some errors related to handoff of partition in >> /etc/riak/log/errors. >> Details are as below: >> >> 2015-07-30 16:04:33.643 [error] >> <0.12872.15>@riak_core_handoff_sender:start_fold:262 ownership_transfer >> transfer of riak_kv_vnode from 'riak@10.21.136.76' >> 45671926166590716193865151022383844364247891968 to 'riak@10.21.136.93' >> 45671926166590716193865151022383844364247891968 failed because of enotconn >> 2015-07-30 16:04:33.643 [error] >> <0.197.0>@riak_core_handoff_manager:handle_info:289 An outbound handoff of >> partition riak_kv_vnode 45671926166590716193865151022383844364247891968 was >> terminated for reason: {shutdown,{error,enotconn}} >> >> >> >> I have searched it with google and found related articles. However, >> there's no solution. >> >> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2014-October/016052.html >> >> >> On Mon, Aug 10, 2015 at 10:09 PM, Dmitri Zagidulin <dzagidu...@basho.com> >> wrote: >> >>> Hi Changmao, >>> >>> The state of the cluster can be determined from running 'riak-admin >>> member-status' and 'riak-admin ring-status'. >>> If I understand the sequence of events, you: >>> 1) Joined four new nodes to the cluster. (Which crashed due to not >>> enough disk space) >>> 2) Removed them from the cluster via 'riak-admin cluster leave'. This >>> is a "planned remove" command, and expects for the nodes to gradually hand >>> off their partitions (to transfer ownership) before actually leaving. So >>> this is probably the main problem - the ring is stuck waiting for those >>> nodes to properly hand off. >>> >>> 3) Re-formatted those four nodes and re-installed Riak. Here is where it >>> gets tricky though. Several questions for you: >>> - Did you attempt to re-join those 4 reinstalled nodes into the cluster? >>> What was the output of the cluster join and cluster plan commands? >>> - Did the IP address change, after they were reformatted? If so, you >>> probably need to use something like 'reip' at this point: >>> http://docs.basho.com/riak/latest/ops/running/tools/riak-admin/#reip >>> >>> The 'failed because of enotconn' error message is happening because the >>> cluster is waiting to hand off partitions to .94, but cannot connect to it. >>> >>> Anyways, here's what I recommend. If you can lose the data, it's >>> probably easier to format and reinstall the whole cluster. >>> If not, you can 'force-remove' those four nodes, one by one (see >>> http://docs.basho.com/riak/latest/ops/running/cluster-admin/#force-remove >>> ) >>> >>> >>> >>> On Thu, Aug 6, 2015 at 11:55 PM, changmao wang <wang.chang...@gmail.com> >>> wrote: >>> >>>> Dmitri, >>>> >>>> Thanks for your quick reply. >>>> my question are as below: >>>> 1. what's the current status of the whole cluster? Is't doing data >>>> balance? >>>> 2. there's so many errors during one of the node error log. how to >>>> handle it? >>>> 2015-08-05 01:38:59.717 [error] >>>> <0.23000.298>@riak_core_handoff_sender:start_fold:262 ownership_transfer >>>> transfer of riak_kv_vnode from 'riak@10.21.136.81' >>>> 525227150915793236229449236757414210188850757632 to 'riak@10.21.136.94' >>>> 525227150915793236229449236757414210188850757632 failed because of enotconn >>>> 2015-08-05 01:38:59.718 [error] >>>> <0.195.0>@riak_core_handoff_manager:handle_info:289 An outbound handoff of >>>> partition riak_kv_vnode 525227150915793236229449236757414210188850757632 >>>> was terminated for reason: {shutdown,{error,enotconn}} >>>> >>>> During the last 5 days, there's no changes of the "riak-admin member >>>> status" output. >>>> 3. how to accelerate the data balance? >>>> >>>> >>>> On Fri, Aug 7, 2015 at 6:41 AM, Dmitri Zagidulin <dzagidu...@basho.com> >>>> wrote: >>>> >>>>> Ok, I think I understand so far. So what's the question? >>>>> >>>>> On Thursday, August 6, 2015, Changmao.Wang <changmao.w...@datayes.com> >>>>> wrote: >>>>> >>>>>> Hi Riak users, >>>>>> >>>>>> Before adding new nodes, the cluster only have five nodes. The member >>>>>> list are as below: >>>>>> 10.21.136.66,10.21.136.71,10.21.136.76,10.21.136.81,10.21.136.86. >>>>>> We did not setup http proxy for the cluster, only one node of the >>>>>> cluster provide the http service. so the CPU load is always high on this >>>>>> node. >>>>>> >>>>>> After that, I added four nodes (10.21.136.[91-94]) to those cluster. >>>>>> During the ring/data balance progress, each node failed(riak stopped) >>>>>> because of disk 100% full. >>>>>> I used multi-disk path to "data_root" parameter in >>>>>> '/etc/riak/app.config'. Each disk is only 580MB size. >>>>>> As you know, bitcask storage engine did not support multi-disk path. >>>>>> After one of the disks is 100% full, it can not switch next idle disk. So >>>>>> the "riak" service is down. >>>>>> >>>>>> After that, I removed the new add four nodes at active nodes with >>>>>> "riak-admin cluster leave riak@'10.21.136.91'". >>>>>> and then stop "riak" service on other active new nodes, reformat the >>>>>> above new nodes with LVM disk management (bind 6 disk with virtual disk >>>>>> group). >>>>>> Replace the "data-root" parameter with one folder, and then start >>>>>> "riak" service again. After that, the cluster began the data balance >>>>>> again. >>>>>> That's the whole story. >>>>>> >>>>>> >>>>>> Amao >>>>>> >>>>>> ------------------------------ >>>>>> *From: *"Dmitri Zagidulin" <dzagidu...@basho.com> >>>>>> *To: *"Changmao.Wang" <changmao.w...@datayes.com> >>>>>> *Sent: *Thursday, August 6, 2015 10:46:59 PM >>>>>> *Subject: *Re: why leaving riak cluster so slowly and how to >>>>>> accelerate the speed >>>>>> >>>>>> Hi Amao, >>>>>> >>>>>> Can you explain a bit more which steps you've taken, and what the >>>>>> problem is? >>>>>> >>>>>> Which nodes have been added, and which nodes are leaving the cluster? >>>>>> >>>>>> On Tue, Jul 28, 2015 at 11:03 PM, Changmao.Wang < >>>>>> changmao.w...@datayes.com> wrote: >>>>>> >>>>>>> Hi Raik user group, >>>>>>> >>>>>>> I'm using riak and riak-cs 1.4.2. Last weekend, I added four nodes >>>>>>> to cluster with 5 nodes. However, it's failed with one of disks 100% >>>>>>> full. >>>>>>> As you know bitcask storage engine can not support multifolders. >>>>>>> >>>>>>> After that, I restarted the "riak" and leave the cluster with the >>>>>>> command "riak-admin cluster leave" and "riak-admin cluster plan", and >>>>>>> the >>>>>>> commit. >>>>>>> However, riak is always doing KV balance after my submit leaving >>>>>>> command. I guess that it's doing join cluster progress. >>>>>>> >>>>>>> Could you show us how to accelerate the leaving progress? I have >>>>>>> tuned the "transfer-limit" parameters on 9 nodes. >>>>>>> >>>>>>> below is some commands output: >>>>>>> riak-admin member-status >>>>>>> ================================= Membership >>>>>>> ================================== >>>>>>> Status Ring Pending Node >>>>>>> >>>>>>> ------------------------------------------------------------------------------- >>>>>>> leaving 6.3% 10.9% 'riak@10.21.136.91' >>>>>>> leaving 9.4% 10.9% 'riak@10.21.136.92' >>>>>>> leaving 6.3% 10.9% 'riak@10.21.136.93' >>>>>>> leaving 6.3% 10.9% 'riak@10.21.136.94' >>>>>>> valid 10.9% 10.9% 'riak@10.21.136.66' >>>>>>> valid 12.5% 10.9% 'riak@10.21.136.71' >>>>>>> valid 18.8% 10.9% 'riak@10.21.136.76' >>>>>>> valid 18.8% 12.5% 'riak@10.21.136.81' >>>>>>> valid 10.9% 10.9% 'riak@10.21.136.86' >>>>>>> >>>>>>> riak-admin transfer_limit >>>>>>> =============================== Transfer Limit >>>>>>> ================================ >>>>>>> Limit Node >>>>>>> >>>>>>> ------------------------------------------------------------------------------- >>>>>>> 200 'riak@10.21.136.66' >>>>>>> 200 'riak@10.21.136.71' >>>>>>> 100 'riak@10.21.136.76' >>>>>>> 100 'riak@10.21.136.81' >>>>>>> 200 'riak@10.21.136.86' >>>>>>> 500 'riak@10.21.136.91' >>>>>>> 500 'riak@10.21.136.92' >>>>>>> 500 'riak@10.21.136.93' >>>>>>> 500 'riak@10.21.136.94' >>>>>>> >>>>>>> Any more details for your diagnosing the problem? >>>>>>> >>>>>>> Amao >>>>>>> >>>>>>> _______________________________________________ >>>>>>> riak-users mailing list >>>>>>> riak-users@lists.basho.com >>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> riak-users mailing list >>>>> riak-users@lists.basho.com >>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>>> >>>>> >>>> >>>> >>>> -- >>>> Amao Wang >>>> Best & Regards >>>> >>> >>> >>> _______________________________________________ >>> riak-users mailing list >>> riak-users@lists.basho.com >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >>> >> >> >> -- >> Amao Wang >> Best & Regards >> > > -- Amao Wang Best & Regards >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com