have a new node take over the role of a downed, unrecoverable node?

2010-10-16 Thread Jesse Newland
I'm running through some disaster scenarios before bringing a riak cluster into 
production, and have run into a scenario that I can't work through the proper 
resolution for just yet:

Say an ec2 instance that was a part of a ring went away quickly, and data from 
it was unrecoverable.

How might I go about telling the rest of the ring that a new instance that I've 
brought up should take over the vnodes that were on that old instance? This 
sounds like a job for `riak-admin reip`, but after running `reip downed_node 
new_node`, `riak-admin ringready` still shows that the old nodes are a part of 
the ring and down. I guess what I'd like to do is a posthumeous `leave`?

Thoughts?

Regards -

Jesse Newland
---
je...@railsmachine.com
404.216.1093

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: have a new node take over the role of a downed, unrecoverable node?

2010-10-16 Thread Sean Cribbs
`leave` is exactly what you want to do then.  Once the old node has left (use 
`ringready` to track its exit), add the new neode. 

If the EBS volume containing the node's data was not lost, you could mount it 
onto the new node to save some recovery time, and then reip.  However, you'll 
need to reip on all machines.

Sean Cribbs 
Developer Advocate
Basho Technologies, Inc.
http://basho.com/

On Oct 16, 2010, at 2:54 PM, Jesse Newland wrote:

> I'm running through some disaster scenarios before bringing a riak cluster 
> into production, and have run into a scenario that I can't work through the 
> proper resolution for just yet:
> 
> Say an ec2 instance that was a part of a ring went away quickly, and data 
> from it was unrecoverable.
> 
> How might I go about telling the rest of the ring that a new instance that 
> I've brought up should take over the vnodes that were on that old instance? 
> This sounds like a job for `riak-admin reip`, but after running `reip 
> downed_node new_node`, `riak-admin ringready` still shows that the old nodes 
> are a part of the ring and down. I guess what I'd like to do is a posthumeous 
> `leave`?
> 
> Thoughts?
> 
> Regards -
> 
> Jesse Newland
> ---
> je...@railsmachine.com
> 404.216.1093
> 
> ___
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: have a new node take over the role of a downed, unrecoverable node?

2010-10-16 Thread Jesse Newland
The description of leave on the wiki mentions that it "causes the node to leave 
the cluster it participates in" - I assume "the node" refers to the node this 
command is run on? How would I "leave" a node that I can't run this command on 
anymore?

Regards -

Jesse Newland
---
je...@railsmachine.com
404.216.1093

On Oct 16, 2010, at 3:16 PM, Sean Cribbs wrote:

> `leave` is exactly what you want to do then.  Once the old node has left (use 
> `ringready` to track its exit), add the new neode. 
> 
> If the EBS volume containing the node's data was not lost, you could mount it 
> onto the new node to save some recovery time, and then reip.  However, you'll 
> need to reip on all machines.
> 
> Sean Cribbs 
> Developer Advocate
> Basho Technologies, Inc.
> http://basho.com/
> 
> On Oct 16, 2010, at 2:54 PM, Jesse Newland wrote:
> 
>> I'm running through some disaster scenarios before bringing a riak cluster 
>> into production, and have run into a scenario that I can't work through the 
>> proper resolution for just yet:
>> 
>> Say an ec2 instance that was a part of a ring went away quickly, and data 
>> from it was unrecoverable.
>> 
>> How might I go about telling the rest of the ring that a new instance that 
>> I've brought up should take over the vnodes that were on that old instance? 
>> This sounds like a job for `riak-admin reip`, but after running `reip 
>> downed_node new_node`, `riak-admin ringready` still shows that the old nodes 
>> are a part of the ring and down. I guess what I'd like to do is a 
>> posthumeous `leave`?
>> 
>> Thoughts?
>> 
>> Regards -
>> 
>> Jesse Newland
>> ---
>> je...@railsmachine.com
>> 404.216.1093
>> 
>> ___
>> riak-users mailing list
>> riak-users@lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: have a new node take over the role of a downed, unrecoverable node?

2010-10-16 Thread Alexander Sicular
This has come up before. "Leave" is what is currently available and
needs to be run on the node that wants to leave. This, of course,
means the node needs to be available. What you really want is a kick
like "remove" or something that doesn't exist yet, afaik. I think
there is a ticket open.

-alexander

On 2010-10-16, Jesse Newland  wrote:
> The description of leave on the wiki mentions that it "causes the node to
> leave the cluster it participates in" - I assume "the node" refers to the
> node this command is run on? How would I "leave" a node that I can't run
> this command on anymore?
>
> Regards -
>
> Jesse Newland
> ---
> je...@railsmachine.com
> 404.216.1093
>
> On Oct 16, 2010, at 3:16 PM, Sean Cribbs wrote:
>
>> `leave` is exactly what you want to do then.  Once the old node has left
>> (use `ringready` to track its exit), add the new neode.
>>
>> If the EBS volume containing the node's data was not lost, you could mount
>> it onto the new node to save some recovery time, and then reip.  However,
>> you'll need to reip on all machines.
>>
>> Sean Cribbs 
>> Developer Advocate
>> Basho Technologies, Inc.
>> http://basho.com/
>>
>> On Oct 16, 2010, at 2:54 PM, Jesse Newland wrote:
>>
>>> I'm running through some disaster scenarios before bringing a riak
>>> cluster into production, and have run into a scenario that I can't work
>>> through the proper resolution for just yet:
>>>
>>> Say an ec2 instance that was a part of a ring went away quickly, and data
>>> from it was unrecoverable.
>>>
>>> How might I go about telling the rest of the ring that a new instance
>>> that I've brought up should take over the vnodes that were on that old
>>> instance? This sounds like a job for `riak-admin reip`, but after running
>>> `reip downed_node new_node`, `riak-admin ringready` still shows that the
>>> old nodes are a part of the ring and down. I guess what I'd like to do is
>>> a posthumeous `leave`?
>>>
>>> Thoughts?
>>>
>>> Regards -
>>>
>>> Jesse Newland
>>> ---
>>> je...@railsmachine.com
>>> 404.216.1093
>>>
>>> ___
>>> riak-users mailing list
>>> riak-users@lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
>

-- 
Sent from my mobile device

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: have a new node take over the role of a downed, unrecoverable node?

2010-10-16 Thread Sean Cribbs
Sorry, I wasn't completely clear. You can make any node "leave" from the 
console. e.g.

riak_core_gossip:remove_from_cluster('r...@some-host.com').

Sean Cribbs 
Developer Advocate
Basho Technologies, Inc.
http://basho.com/

On Oct 16, 2010, at 5:05 PM, Alexander Sicular wrote:

> This has come up before. "Leave" is what is currently available and
> needs to be run on the node that wants to leave. This, of course,
> means the node needs to be available. What you really want is a kick
> like "remove" or something that doesn't exist yet, afaik. I think
> there is a ticket open.
> 
> -alexander
> 
> On 2010-10-16, Jesse Newland  wrote:
>> The description of leave on the wiki mentions that it "causes the node to
>> leave the cluster it participates in" - I assume "the node" refers to the
>> node this command is run on? How would I "leave" a node that I can't run
>> this command on anymore?
>> 
>> Regards -
>> 
>> Jesse Newland
>> ---
>> je...@railsmachine.com
>> 404.216.1093
>> 
>> On Oct 16, 2010, at 3:16 PM, Sean Cribbs wrote:
>> 
>>> `leave` is exactly what you want to do then.  Once the old node has left
>>> (use `ringready` to track its exit), add the new neode.
>>> 
>>> If the EBS volume containing the node's data was not lost, you could mount
>>> it onto the new node to save some recovery time, and then reip.  However,
>>> you'll need to reip on all machines.
>>> 
>>> Sean Cribbs 
>>> Developer Advocate
>>> Basho Technologies, Inc.
>>> http://basho.com/
>>> 
>>> On Oct 16, 2010, at 2:54 PM, Jesse Newland wrote:
>>> 
 I'm running through some disaster scenarios before bringing a riak
 cluster into production, and have run into a scenario that I can't work
 through the proper resolution for just yet:
 
 Say an ec2 instance that was a part of a ring went away quickly, and data
 from it was unrecoverable.
 
 How might I go about telling the rest of the ring that a new instance
 that I've brought up should take over the vnodes that were on that old
 instance? This sounds like a job for `riak-admin reip`, but after running
 `reip downed_node new_node`, `riak-admin ringready` still shows that the
 old nodes are a part of the ring and down. I guess what I'd like to do is
 a posthumeous `leave`?
 
 Thoughts?
 
 Regards -
 
 Jesse Newland
 ---
 je...@railsmachine.com
 404.216.1093
 
 ___
 riak-users mailing list
 riak-users@lists.basho.com
 http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>> 
>> 
>> 
> 
> -- 
> Sent from my mobile device


___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: have a new node take over the role of a downed, unrecoverable node?

2010-10-16 Thread Jesse Newland
Thanks Sean!

Regards -

Jesse Newland
---
je...@railsmachine.com
404.216.1093

On Oct 16, 2010, at 7:01 PM, Sean Cribbs wrote:

> Sorry, I wasn't completely clear. You can make any node "leave" from the 
> console. e.g.
> 
> riak_core_gossip:remove_from_cluster('r...@some-host.com').
> 
> Sean Cribbs 
> Developer Advocate
> Basho Technologies, Inc.
> http://basho.com/
> 
> On Oct 16, 2010, at 5:05 PM, Alexander Sicular wrote:
> 
>> This has come up before. "Leave" is what is currently available and
>> needs to be run on the node that wants to leave. This, of course,
>> means the node needs to be available. What you really want is a kick
>> like "remove" or something that doesn't exist yet, afaik. I think
>> there is a ticket open.
>> 
>> -alexander
>> 
>> On 2010-10-16, Jesse Newland  wrote:
>>> The description of leave on the wiki mentions that it "causes the node to
>>> leave the cluster it participates in" - I assume "the node" refers to the
>>> node this command is run on? How would I "leave" a node that I can't run
>>> this command on anymore?
>>> 
>>> Regards -
>>> 
>>> Jesse Newland
>>> ---
>>> je...@railsmachine.com
>>> 404.216.1093
>>> 
>>> On Oct 16, 2010, at 3:16 PM, Sean Cribbs wrote:
>>> 
 `leave` is exactly what you want to do then.  Once the old node has left
 (use `ringready` to track its exit), add the new neode.
 
 If the EBS volume containing the node's data was not lost, you could mount
 it onto the new node to save some recovery time, and then reip.  However,
 you'll need to reip on all machines.
 
 Sean Cribbs 
 Developer Advocate
 Basho Technologies, Inc.
 http://basho.com/
 
 On Oct 16, 2010, at 2:54 PM, Jesse Newland wrote:
 
> I'm running through some disaster scenarios before bringing a riak
> cluster into production, and have run into a scenario that I can't work
> through the proper resolution for just yet:
> 
> Say an ec2 instance that was a part of a ring went away quickly, and data
> from it was unrecoverable.
> 
> How might I go about telling the rest of the ring that a new instance
> that I've brought up should take over the vnodes that were on that old
> instance? This sounds like a job for `riak-admin reip`, but after running
> `reip downed_node new_node`, `riak-admin ringready` still shows that the
> old nodes are a part of the ring and down. I guess what I'd like to do is
> a posthumeous `leave`?
> 
> Thoughts?
> 
> Regards -
> 
> Jesse Newland
> ---
> je...@railsmachine.com
> 404.216.1093
> 
> ___
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
 
>>> 
>>> 
>> 
>> -- 
>> Sent from my mobile device
> 


___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: have a new node take over the role of a downed, unrecoverable node?

2010-10-16 Thread Leonid Riaboshtan
>> However, you'll need to reip on all machines.

Hmm, isn't stuff like that should be treated automaticly by Riak? I mean I
have a cluster where nodes leave, nodes come. And after each come/leave I
need to do something to nodes in entire cluster to entroduce/remove new/old
node and repartion the data?

And question sounds rather strange to me, what is the node role in system
where all nodes are equal? It's everywhere said that  Riak will
automatically re-balance data as nodes join and leave the cluster. It's not
the case when node becomes unreachable and cluster would repartion data to
keep it solid (like keeping n_val for keys)?

Or something else should watch for nodes states and tell cluster that node
is down?

It's also said that:
The ring state is shared around the cluster by means of a "gossip protocol".
Whenever a node changes its claim on the ring, it announces its change via
this protocol. It also periodically re-announces what it knows about the
ring, in case any nodes missed previous updates.

Isn't cluster checking on unavailable nodes that way too?

I'm not offending anyone, just trying to make things more clear for myself.

On Sun, Oct 17, 2010 at 4:56 AM, Jesse Newland wrote:

> Thanks Sean!
>
> Regards -
>
> Jesse Newland
> ---
> je...@railsmachine.com
> 404.216.1093
>
> On Oct 16, 2010, at 7:01 PM, Sean Cribbs wrote:
>
> > Sorry, I wasn't completely clear. You can make any node "leave" from the
> console. e.g.
> >
> > riak_core_gossip:remove_from_cluster('r...@some-host.com').
> >
> > Sean Cribbs 
> > Developer Advocate
> > Basho Technologies, Inc.
> > http://basho.com/
> >
> > On Oct 16, 2010, at 5:05 PM, Alexander Sicular wrote:
> >
> >> This has come up before. "Leave" is what is currently available and
> >> needs to be run on the node that wants to leave. This, of course,
> >> means the node needs to be available. What you really want is a kick
> >> like "remove" or something that doesn't exist yet, afaik. I think
> >> there is a ticket open.
> >>
> >> -alexander
> >>
> >> On 2010-10-16, Jesse Newland  wrote:
> >>> The description of leave on the wiki mentions that it "causes the node
> to
> >>> leave the cluster it participates in" - I assume "the node" refers to
> the
> >>> node this command is run on? How would I "leave" a node that I can't
> run
> >>> this command on anymore?
> >>>
> >>> Regards -
> >>>
> >>> Jesse Newland
> >>> ---
> >>> je...@railsmachine.com
> >>> 404.216.1093
> >>>
> >>> On Oct 16, 2010, at 3:16 PM, Sean Cribbs wrote:
> >>>
>  `leave` is exactly what you want to do then.  Once the old node has
> left
>  (use `ringready` to track its exit), add the new neode.
> 
>  If the EBS volume containing the node's data was not lost, you could
> mount
>  it onto the new node to save some recovery time, and then reip.
>  However,
>  you'll need to reip on all machines.
> 
>  Sean Cribbs 
>  Developer Advocate
>  Basho Technologies, Inc.
>  http://basho.com/
> 
>  On Oct 16, 2010, at 2:54 PM, Jesse Newland wrote:
> 
> > I'm running through some disaster scenarios before bringing a riak
> > cluster into production, and have run into a scenario that I can't
> work
> > through the proper resolution for just yet:
> >
> > Say an ec2 instance that was a part of a ring went away quickly, and
> data
> > from it was unrecoverable.
> >
> > How might I go about telling the rest of the ring that a new instance
> > that I've brought up should take over the vnodes that were on that
> old
> > instance? This sounds like a job for `riak-admin reip`, but after
> running
> > `reip downed_node new_node`, `riak-admin ringready` still shows that
> the
> > old nodes are a part of the ring and down. I guess what I'd like to
> do is
> > a posthumeous `leave`?
> >
> > Thoughts?
> >
> > Regards -
> >
> > Jesse Newland
> > ---
> > je...@railsmachine.com
> > 404.216.1093
> >
> > ___
> > riak-users mailing list
> > riak-users@lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> >>>
> >>>
> >>
> >> --
> >> Sent from my mobile device
> >
>
>
> ___
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: have a new node take over the role of a downed, unrecoverable node?

2010-10-16 Thread Sean Cribbs

On Oct 16, 2010, at 9:52 PM, Leonid Riaboshtan wrote:

> >> However, you'll need to reip on all machines.
> 
> Hmm, isn't stuff like that should be treated automaticly by Riak? I mean I 
> have a cluster where nodes leave, nodes come. And after each come/leave I 
> need to do something to nodes in entire cluster to entroduce/remove new/old 
> node and repartion the data?
> 

There are two kinds of "coming" and "leaving" -- temporary and permanent.  
Temporary absences, yes, are handled automatically with things like hinted 
handoff and read repair.  Permanent absences -- as in the case of EC2 instance 
outage -- are things that need to be handled by a competent system 
administrator or developer.  The truth is, changes to the ring state are 
expensive - it needs to be gossiped around, data needs to shift around the 
cluster.  If Riak were automatically making those operational decisions for 
you, performance and stability would suffer.

As a side note, the needing to reip on all machines is a problem with the 
`reip` command, not the core gossip functionality.

> And question sounds rather strange to me, what is the node role in system 
> where all nodes are equal? It's everywhere said that  Riak will automatically 
> re-balance data as nodes join and leave the cluster. It's not the case when 
> node becomes unreachable and cluster would repartion data to keep it solid 
> (like keeping n_val for keys)? 
> 

Yes, n_val is still respected -- fallbacks take over for the missing node(s), 
even in an extended outage.  In the quoted sentence, we're talking about more 
permanent membership changes.

> Or something else should watch for nodes states and tell cluster that node is 
> down? 
> 
> It's also said that:
> The ring state is shared around the cluster by means of a "gossip protocol". 
> Whenever a node changes its claim on the ring, it announces its change via 
> this protocol. It also periodically re-announces what it knows about the 
> ring, in case any nodes missed previous updates.
> 

Actually, Erlang's built-in networking takes care of a lot of the checking for 
node availability; connected nodes will periodically send heartbeat messages to 
one another.  If a node becomes unavailable, it is removed from preflists and 
fallbacks take over.  The gossip protocol is for propagating changes to the 
ring state around the cluster. If the ring were frequently unstable (temporary 
partitions/failures affecting membership), you'd have a lot of trouble 
performing normal operations.

As I said above, the key difference here is between temporary and permanent 
failures.

> Isn't cluster checking on unavailable nodes that way too?
> 
> I'm not offending anyone, just trying to make things more clear for myself.
> 

Not offended, your questions are important to answer and demonstrate a lack of 
clarity in our documentation.  Any suggestions you have on how to clear that up 
would be appreciated!


Sean Cribbs 
Developer Advocate
Basho Technologies, Inc.
http://basho.com/___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com