Pending 0% just means no pending transfers, the cluster state is stable. If you've successfully tested the process on a test cluster, there's no reason why it'd be different in production.
On Friday, August 14, 2015, changmao wang <wang.chang...@gmail.com> wrote: > During last three days, I setup a developing riak cluster with five nodes, > and used "s3cmd" to upload 18GB testing data(maybe 20 thousands of files). > After that, I tried to let one node leaving the cluster, and then shutdown > and mark down it. Replacing the IP address and joining the cluster again. > The above whole processes were successful. However, I'm not sure whether > no not it can be done on production environment. > > I followed below the docs to do above steps: > > http://docs.basho.com/riak/latest/ops/running/nodes/renaming/ > > After I run "riak-admin cluster leave riak@'x.x.x.x'" ,"riak-admin > cluster plan", "riak-admin cluster commit", then checked the member-status, > the main difference of leaving cluster on production and developing > environment are as below: > > root@cluster-s3-dev-hd1:~# riak-admin member-status > ================================= Membership > ================================== > Status Ring Pending Node > > ------------------------------------------------------------------------------- > leaving 18.8% 0.0% 'riak@10.21.236.185 > <javascript:_e(%7B%7D,'cvml','riak@10.21.236.185');>' > valid 21.9% 25.0% 'riak@10.21.236.181 > <javascript:_e(%7B%7D,'cvml','riak@10.21.236.181');>' > valid 21.9% 25.0% 'riak@10.21.236.182 > <javascript:_e(%7B%7D,'cvml','riak@10.21.236.182');>' > valid 18.8% 25.0% 'riak@10.21.236.183 > <javascript:_e(%7B%7D,'cvml','riak@10.21.236.183');>' > valid 18.8% 25.0% 'riak@10.21.236.184 > <javascript:_e(%7B%7D,'cvml','riak@10.21.236.184');>' > > ------------------------------------------------------------------------------- > > several minutes elapsed, the then checking the status as below: > > > root@cluster-s3-dev-hd1:~# riak-admin member-status > ================================= Membership > ================================== > Status Ring Pending Node > > ------------------------------------------------------------------------------- > leaving 12.5% 0.0% 'riak@10.21.236.185 > <javascript:_e(%7B%7D,'cvml','riak@10.21.236.185');>' > valid 21.9% 25.0% 'riak@10.21.236.181 > <javascript:_e(%7B%7D,'cvml','riak@10.21.236.181');>' > valid 28.1% 25.0% 'riak@10.21.236.182 > <javascript:_e(%7B%7D,'cvml','riak@10.21.236.182');>' > valid 18.8% 25.0% 'riak@10.21.236.183 > <javascript:_e(%7B%7D,'cvml','riak@10.21.236.183');>' > valid 18.8% 25.0% 'riak@10.21.236.184 > <javascript:_e(%7B%7D,'cvml','riak@10.21.236.184');>' > > ------------------------------------------------------------------------------- > Valid:4 / Leaving:1 / Exiting:0 / Joining:0 / Down:0 > > After that, I shutdown riak with "riak stop", and mark down it on active > nodes. > My question is what's the meaning ot "Pending 0.0%"? > > On production cluster, the status are as below: > root@cluster1-hd12:/root/scripts# riak-admin transfers > 'riak@10.21.136.94 <javascript:_e(%7B%7D,'cvml','riak@10.21.136.94');>' > waiting to handoff 5 partitions > 'riak@10.21.136.93 <javascript:_e(%7B%7D,'cvml','riak@10.21.136.93');>' > waiting to handoff 5 partitions > 'riak@10.21.136.92 <javascript:_e(%7B%7D,'cvml','riak@10.21.136.92');>' > waiting to handoff 5 partitions > 'riak@10.21.136.91 <javascript:_e(%7B%7D,'cvml','riak@10.21.136.91');>' > waiting to handoff 5 partitions > 'riak@10.21.136.86 <javascript:_e(%7B%7D,'cvml','riak@10.21.136.86');>' > waiting to handoff 5 partitions > 'riak@10.21.136.81 <javascript:_e(%7B%7D,'cvml','riak@10.21.136.81');>' > waiting to handoff 2 partitions > 'riak@10.21.136.76 <javascript:_e(%7B%7D,'cvml','riak@10.21.136.76');>' > waiting to handoff 3 partitions > 'riak@10.21.136.71 <javascript:_e(%7B%7D,'cvml','riak@10.21.136.71');>' > waiting to handoff 5 partitions > 'riak@10.21.136.66 <javascript:_e(%7B%7D,'cvml','riak@10.21.136.66');>' > waiting to handoff 5 partitions > > And there're active transfers. On developing environment, there're no > active transfers after my running of "riak-admin cluster commit". > Can I follow the same steps as developing environment to run it on > production cluster? > > > > On Wed, Aug 12, 2015 at 10:39 PM, Dmitri Zagidulin <dzagidu...@basho.com > <javascript:_e(%7B%7D,'cvml','dzagidu...@basho.com');>> wrote: > >> Responses inline. >> >> >> On Tue, Aug 11, 2015 at 12:53 PM, changmao wang <wang.chang...@gmail.com >> <javascript:_e(%7B%7D,'cvml','wang.chang...@gmail.com');>> wrote: >> >>> 1. About backuping new nodes of four and then using 'riak-admin >>> force-replace'. what's the status of new added nodes? >>> as you know, we want to replace one of leaving nodes. >>> >> >> I don't understand the question. Doing 'riak-admin force-replace' on one >> of the nodes that's leaving should overwrite the leave request and tell it >> to change its node id / ip address. (If that doesn't work, stop the leaving >> node, and do a 'riak-admin reip' command instead). >> >> >> >>> 2. what's the risk of 'riak-admin force-remove' 'riak@10.21.136.91 >>> <javascript:_e(%7B%7D,'cvml','riak@10.21.136.91');>' without backup? >>> As you know, now the node(riak@10.21.136.91 >>> <javascript:_e(%7B%7D,'cvml','riak@10.21.136.91');>) is a member of the >>> cluster, and keeping almost 2.5TB data, maybe 10 percent of the whole >>> cluster. >>> >> >> The only reason I asked about backup is because it sounded like you >> cleared the disk on it. If it currently has the data, then it'll be fine. >> Force-remove just changes the IP address, and doesn't delete the data or >> anything. >> >> >> On Tue, Aug 11, 2015 at 7:32 PM, Dmitri Zagidulin <dzagidu...@basho.com >> <javascript:_e(%7B%7D,'cvml','dzagidu...@basho.com');>> wrote: >> >>> 1. How to force leave "leaving"'s nodes without data loss? >>> >>> This depends on - did you back up the data directory of the 4 new nodes, >>> before you reformatted them? >>> If you backed them up (and then restored the data directory once you >>> reformatted them), you can try: >>> >>> riak-admin force-replace 'riak@10.21.136.91 >>> <javascript:_e(%7B%7D,'cvml','riak@10.21.136.91');>' 'riak@<whatever >>> your new ip address is for that node>' >>> (same for the other 3) >>> >>> If you did not back up those nodes, the only thing you can do is force >>> them to leave, and then join the new ones. So, for each of the 4: >>> >>> riak-admin force-remove 'riak@10.21.136.91 >>> <javascript:_e(%7B%7D,'cvml','riak@10.21.136.91');>' 'riak@10.21.136.66 >>> <javascript:_e(%7B%7D,'cvml','riak@10.21.136.66');>' >>> (same for the other 3) >>> >>> In either case, after force-replacing or force-removing, you have to >>> join the new nodes to the cluster, before you commit. >>> >>> riak-admin join 'riak@new node' 'riak@10.21.136.66 >>> <javascript:_e(%7B%7D,'cvml','riak@10.21.136.66');>' >>> (same for the other 3) >>> and finally: >>> riak-cluster plan >>> riak-cluster commit >>> >>> As for the error, the reason you're seeing it, is because the other >>> nodes can't contact the 4 that are supposed to be leaving. (Since you wiped >>> them). >>> The amount of time that passed doesn't matter, the cluster will be >>> waiting for those nodes to leave indefinitely, unless you force-remove or >>> force-replace. >>> >>> >>> >>> On Tue, Aug 11, 2015 at 1:32 AM, changmao wang <wang.chang...@gmail.com >>> <javascript:_e(%7B%7D,'cvml','wang.chang...@gmail.com');>> wrote: >>> >>>> HI Dmitri, >>>> >>>> For your question, >>>> 3) Re-formatted those four nodes and re-installed Riak. Here is where >>>> it gets tricky though. Several questions for you: >>>> - Did you attempt to re-join those 4 reinstalled nodes into the >>>> cluster? What was the output of the cluster join and cluster plan commands? >>>> - Did the IP address change, after they were reformatted? If so, you >>>> probably need to use something like 'reip' at this point: >>>> http://docs.basho.com/riak/latest/ops/running/tools/riak-admin/#reip >>>> >>>> I did NOT try to re-join those 4 re-join those 4 reinstalled nodes >>>> into the cluster. As you know, member-status shows 'they're leaving" as >>>> below: >>>> riak-admin member-status >>>> ================================= Membership >>>> ================================== >>>> Status Ring Pending Node >>>> >>>> ------------------------------------------------------------------------------- >>>> leaving 10.9% 10.9% 'riak@10.21.136.91 >>>> <javascript:_e(%7B%7D,'cvml','riak@10.21.136.91');>' >>>> leaving 9.4% 10.9% 'riak@10.21.136.92 >>>> <javascript:_e(%7B%7D,'cvml','riak@10.21.136.92');>' >>>> leaving 7.8% 10.9% 'riak@10.21.136.93 >>>> <javascript:_e(%7B%7D,'cvml','riak@10.21.136.93');>' >>>> leaving 7.8% 10.9% 'riak@10.21.136.94 >>>> <javascript:_e(%7B%7D,'cvml','riak@10.21.136.94');>' >>>> valid 10.9% 10.9% 'riak@10.21.136.66 >>>> <javascript:_e(%7B%7D,'cvml','riak@10.21.136.66');>' >>>> valid 10.9% 10.9% 'riak@10.21.136.71 >>>> <javascript:_e(%7B%7D,'cvml','riak@10.21.136.71');>' >>>> valid 14.1% 10.9% 'riak@10.21.136.76 >>>> <javascript:_e(%7B%7D,'cvml','riak@10.21.136.76');>' >>>> valid 17.2% 12.5% 'riak@10.21.136.81 >>>> <javascript:_e(%7B%7D,'cvml','riak@10.21.136.81');>' >>>> valid 10.9% 10.9% 'riak@10.21.136.86 >>>> <javascript:_e(%7B%7D,'cvml','riak@10.21.136.86');>' >>>> >>>> ------------------------------------------------------------------------------- >>>> Valid:5 / Leaving:4 / Exiting:0 / Joining:0 / Down:0 >>>> >>>> two weeks elapsed, 'riak-admin member-status' shows same result. I >>>> don't know which step ring hand off? >>>> >>>> I did not changed the IP address of four newly adding nodes. >>>> >>>> My questions: >>>> >>>> 1. How to force leave "leaving"'s nodes without data loss? >>>> 2. I have found some errors related to handoff of partition in >>>> /etc/riak/log/errors. >>>> Details are as below: >>>> >>>> 2015-07-30 16:04:33.643 [error] >>>> <0.12872.15>@riak_core_handoff_sender:start_fold:262 ownership_transfer >>>> transfer of riak_kv_vnode from 'riak@10.21.136.76 >>>> <javascript:_e(%7B%7D,'cvml','riak@10.21.136.76');>' >>>> 45671926166590716193865151022383844364247891968 to 'riak@10.21.136.93 >>>> <javascript:_e(%7B%7D,'cvml','riak@10.21.136.93');>' >>>> 45671926166590716193865151022383844364247891968 failed because of enotconn >>>> 2015-07-30 16:04:33.643 [error] >>>> <0.197.0>@riak_core_handoff_manager:handle_info:289 An outbound handoff of >>>> partition riak_kv_vnode 45671926166590716193865151022383844364247891968 was >>>> terminated for reason: {shutdown,{error,enotconn}} >>>> >>>> >>>> >>>> I have searched it with google and found related articles. However, >>>> there's no solution. >>>> >>>> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2014-October/016052.html >>>> >>>> >>>> On Mon, Aug 10, 2015 at 10:09 PM, Dmitri Zagidulin < >>>> dzagidu...@basho.com >>>> <javascript:_e(%7B%7D,'cvml','dzagidu...@basho.com');>> wrote: >>>> >>>>> Hi Changmao, >>>>> >>>>> The state of the cluster can be determined from running 'riak-admin >>>>> member-status' and 'riak-admin ring-status'. >>>>> If I understand the sequence of events, you: >>>>> 1) Joined four new nodes to the cluster. (Which crashed due to not >>>>> enough disk space) >>>>> 2) Removed them from the cluster via 'riak-admin cluster leave'. This >>>>> is a "planned remove" command, and expects for the nodes to gradually hand >>>>> off their partitions (to transfer ownership) before actually leaving. So >>>>> this is probably the main problem - the ring is stuck waiting for those >>>>> nodes to properly hand off. >>>>> >>>>> 3) Re-formatted those four nodes and re-installed Riak. Here is where >>>>> it gets tricky though. Several questions for you: >>>>> - Did you attempt to re-join those 4 reinstalled nodes into the >>>>> cluster? What was the output of the cluster join and cluster plan >>>>> commands? >>>>> - Did the IP address change, after they were reformatted? If so, you >>>>> probably need to use something like 'reip' at this point: >>>>> http://docs.basho.com/riak/latest/ops/running/tools/riak-admin/#reip >>>>> >>>>> The 'failed because of enotconn' error message is happening because >>>>> the cluster is waiting to hand off partitions to .94, but cannot connect >>>>> to >>>>> it. >>>>> >>>>> Anyways, here's what I recommend. If you can lose the data, it's >>>>> probably easier to format and reinstall the whole cluster. >>>>> If not, you can 'force-remove' those four nodes, one by one (see >>>>> http://docs.basho.com/riak/latest/ops/running/cluster-admin/#force-remove >>>>> ) >>>>> >>>>> >>>>> >>>>> On Thu, Aug 6, 2015 at 11:55 PM, changmao wang < >>>>> wang.chang...@gmail.com >>>>> <javascript:_e(%7B%7D,'cvml','wang.chang...@gmail.com');>> wrote: >>>>> >>>>>> Dmitri, >>>>>> >>>>>> Thanks for your quick reply. >>>>>> my question are as below: >>>>>> 1. what's the current status of the whole cluster? Is't doing data >>>>>> balance? >>>>>> 2. there's so many errors during one of the node error log. how to >>>>>> handle it? >>>>>> 2015-08-05 01:38:59.717 [error] >>>>>> <0.23000.298>@riak_core_handoff_sender:start_fold:262 ownership_transfer >>>>>> transfer of riak_kv_vnode from 'riak@10.21.136.81 >>>>>> <javascript:_e(%7B%7D,'cvml','riak@10.21.136.81');>' >>>>>> 525227150915793236229449236757414210188850757632 to ' >>>>>> riak@10.21.136.94 <javascript:_e(%7B%7D,'cvml','riak@10.21.136.94');>' >>>>>> 525227150915793236229449236757414210188850757632 failed because of >>>>>> enotconn >>>>>> 2015-08-05 01:38:59.718 [error] >>>>>> <0.195.0>@riak_core_handoff_manager:handle_info:289 An outbound handoff >>>>>> of >>>>>> partition riak_kv_vnode 525227150915793236229449236757414210188850757632 >>>>>> was terminated for reason: {shutdown,{error,enotconn}} >>>>>> >>>>>> During the last 5 days, there's no changes of the "riak-admin member >>>>>> status" output. >>>>>> 3. how to accelerate the data balance? >>>>>> >>>>>> >>>>>> On Fri, Aug 7, 2015 at 6:41 AM, Dmitri Zagidulin < >>>>>> dzagidu...@basho.com >>>>>> <javascript:_e(%7B%7D,'cvml','dzagidu...@basho.com');>> wrote: >>>>>> >>>>>>> Ok, I think I understand so far. So what's the question? >>>>>>> >>>>>>> On Thursday, August 6, 2015, Changmao.Wang < >>>>>>> changmao.w...@datayes.com >>>>>>> <javascript:_e(%7B%7D,'cvml','changmao.w...@datayes.com');>> wrote: >>>>>>> >>>>>>>> Hi Riak users, >>>>>>>> >>>>>>>> Before adding new nodes, the cluster only have five nodes. The >>>>>>>> member list are as below: >>>>>>>> 10.21.136.66,10.21.136.71,10.21.136.76,10.21.136.81,10.21.136.86. >>>>>>>> We did not setup http proxy for the cluster, only one node of the >>>>>>>> cluster provide the http service. so the CPU load is always high on >>>>>>>> this >>>>>>>> node. >>>>>>>> >>>>>>>> After that, I added four nodes (10.21.136.[91-94]) to those >>>>>>>> cluster. During the ring/data balance progress, each node failed(riak >>>>>>>> stopped) because of disk 100% full. >>>>>>>> I used multi-disk path to "data_root" parameter in >>>>>>>> '/etc/riak/app.config'. Each disk is only 580MB size. >>>>>>>> As you know, bitcask storage engine did not support multi-disk >>>>>>>> path. After one of the disks is 100% full, it can not switch next idle >>>>>>>> disk. So the "riak" service is down. >>>>>>>> >>>>>>>> After that, I removed the new add four nodes at active nodes with >>>>>>>> "riak-admin cluster leave riak@'10.21.136.91'". >>>>>>>> and then stop "riak" service on other active new nodes, reformat >>>>>>>> the above new nodes with LVM disk management (bind 6 disk with virtual >>>>>>>> disk >>>>>>>> group). >>>>>>>> Replace the "data-root" parameter with one folder, and then start >>>>>>>> "riak" service again. After that, the cluster began the data balance >>>>>>>> again. >>>>>>>> That's the whole story. >>>>>>>> >>>>>>>> >>>>>>>> Amao >>>>>>>> >>>>>>>> ------------------------------ >>>>>>>> *From: *"Dmitri Zagidulin" <dzagidu...@basho.com> >>>>>>>> *To: *"Changmao.Wang" <changmao.w...@datayes.com> >>>>>>>> *Sent: *Thursday, August 6, 2015 10:46:59 PM >>>>>>>> *Subject: *Re: why leaving riak cluster so slowly and how to >>>>>>>> accelerate the speed >>>>>>>> >>>>>>>> Hi Amao, >>>>>>>> >>>>>>>> Can you explain a bit more which steps you've taken, and what the >>>>>>>> problem is? >>>>>>>> >>>>>>>> Which nodes have been added, and which nodes are leaving the >>>>>>>> cluster? >>>>>>>> >>>>>>>> On Tue, Jul 28, 2015 at 11:03 PM, Changmao.Wang < >>>>>>>> changmao.w...@datayes.com> wrote: >>>>>>>> >>>>>>>>> Hi Raik user group, >>>>>>>>> >>>>>>>>> I'm using riak and riak-cs 1.4.2. Last weekend, I added four >>>>>>>>> nodes to cluster with 5 nodes. However, it's failed with one of disks >>>>>>>>> 100% >>>>>>>>> full. >>>>>>>>> As you know bitcask storage engine can not support multifolders. >>>>>>>>> >>>>>>>>> After that, I restarted the "riak" and leave the cluster with the >>>>>>>>> command "riak-admin cluster leave" and "riak-admin cluster plan", and >>>>>>>>> the >>>>>>>>> commit. >>>>>>>>> However, riak is always doing KV balance after my submit leaving >>>>>>>>> command. I guess that it's doing join cluster progress. >>>>>>>>> >>>>>>>>> Could you show us how to accelerate the leaving progress? I have >>>>>>>>> tuned the "transfer-limit" parameters on 9 nodes. >>>>>>>>> >>>>>>>>> below is some commands output: >>>>>>>>> riak-admin member-status >>>>>>>>> ================================= Membership >>>>>>>>> ================================== >>>>>>>>> Status Ring Pending Node >>>>>>>>> >>>>>>>>> ------------------------------------------------------------------------------- >>>>>>>>> leaving 6.3% 10.9% 'riak@10.21.136.91' >>>>>>>>> leaving 9.4% 10.9% 'riak@10.21.136.92' >>>>>>>>> leaving 6.3% 10.9% 'riak@10.21.136.93' >>>>>>>>> leaving 6.3% 10.9% 'riak@10.21.136.94' >>>>>>>>> valid 10.9% 10.9% 'riak@10.21.136.66' >>>>>>>>> valid 12.5% 10.9% 'riak@10.21.136.71' >>>>>>>>> valid 18.8% 10.9% 'riak@10.21.136.76' >>>>>>>>> valid 18.8% 12.5% 'riak@10.21.136.81' >>>>>>>>> valid 10.9% 10.9% 'riak@10.21.136.86' >>>>>>>>> >>>>>>>>> riak-admin transfer_limit >>>>>>>>> =============================== Transfer Limit >>>>>>>>> ================================ >>>>>>>>> Limit Node >>>>>>>>> >>>>>>>>> ------------------------------------------------------------------------------- >>>>>>>>> 200 'riak@10.21.136.66' >>>>>>>>> 200 'riak@10.21.136.71' >>>>>>>>> 100 'riak@10.21.136.76' >>>>>>>>> 100 'riak@10.21.136.81' >>>>>>>>> 200 'riak@10.21.136.86' >>>>>>>>> 500 'riak@10.21.136.91' >>>>>>>>> 500 'riak@10.21.136.92' >>>>>>>>> 500 'riak@10.21.136.93' >>>>>>>>> 500 'riak@10.21.136.94' >>>>>>>>> >>>>>>>>> Any more details for your diagnosing the problem? >>>>>>>>> >>>>>>>>> Amao >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> riak-users mailing list >>>>>>>>> riak-users@lists.basho.com >>>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> riak-users mailing list >>>>>>> riak-users@lists.basho.com >>>>>>> <javascript:_e(%7B%7D,'cvml','riak-users@lists.basho.com');> >>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Amao Wang >>>>>> Best & Regards >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> riak-users mailing list >>>>> riak-users@lists.basho.com >>>>> <javascript:_e(%7B%7D,'cvml','riak-users@lists.basho.com');> >>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>>> >>>>> >>>> >>>> >>>> -- >>>> Amao Wang >>>> Best & Regards >>>> >>> >>> >> >> >> -- >> Amao Wang >> Best & Regards >> >>> >> > > > -- > Amao Wang > Best & Regards >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com