Re: Replacing node without shutting down the old node

2023-05-16 Thread Jeff Jirsa
In-line On May 15, 2023, at 5:26 PM, Runtian Liu  wrote:Hi Jeff,I tried the setup with vnode 16 and NetworkTopologyStrategy replication strategy with replication factor 3 with 3 racks in one cluster. When using the new node token as the old node token - 1I had said +1 but you’re right that it’s actually -1 , sorry about that. You want the new node to be lower than the existing host. The lower token will take most of the data.  I see the new node is streaming from the old node only. And the decom phase of the old node is extremely fast. Does this mean the new node will only take data ownership from the old node?With exactly three racks, yes. With more racks or fewer racks, no.  I also did some cleanups after replacing node with old token - 1 and the cleanup sstable count was not increasing. Looks like adding a node with old_token - 1 and decom the old node will not generate stale data on the rest of the cluster. Do you know if  there are any edge cases that in this replacement process can generate any stale data on other nodes of the cluster with the setup I mentioned?Should do exactly what you want. I’d still run cleanup but it should be a no-op. Thanks,RuntianOn Mon, May 8, 2023 at 9:59 PM Runtian Liu  wrote:I thought the joining node would not participate in quorum? How are we counting things like how many replicas ACK a write when we are adding a new node for expansion? The token ownership won't change until the new node is fully joined right? On Mon, May 8, 2023 at 8:58 PM Jeff Jirsa  wrote:You can't have two nodes with the same token (in the current metadata implementation) - it causes problems counting things like how many replicas ACK a write, and what happens if the one you're replacing ACKs a write but the joining host doesn't? It's harder than it seems to maintain consistency guarantees in that model, because you have 2 nodes where either may end up becoming the sole true owner of the token, and you have to handle both cases where one of them fails. An easier option is to add it with new token set to old token +1 (as an expansion), then decom the leaving node (shrink). That'll minimize streaming when you decommission that node. On Mon, May 8, 2023 at 7:19 PM Runtian Liu  wrote:Hi all,Sometimes we want to replace a node for various reasons, we can replace a node by shutting down the old node and letting the new node stream data from other replicas, but this approach may have availability issues or data consistency issues if one more node in the same cluster went down. Why Cassandra doesn't support replacing a node without shutting down the old one? Can we treat the new node as normal node addition while having exactly the same token ranges as the node to be replaced. After the new node's joining process is complete, we just need to cut off the old node. With this, we don't lose any availability and the token range is not moved so no clean up is needed. Is there any downside of doing this?Thanks,Runtian





Re: Replacing node without shutting down the old node

2023-05-16 Thread Runtian Liu
cool, thank you. This looks like a very good setup for us and cleanup
should be very fast for this case.

On Tue, May 16, 2023 at 5:53 AM Jeff Jirsa  wrote:

>
> In-line
>
> On May 15, 2023, at 5:26 PM, Runtian Liu  wrote:
>
> 
> Hi Jeff,
>
> I tried the setup with vnode 16 and NetworkTopologyStrategy replication
> strategy with replication factor 3 with 3 racks in one cluster. When using
> the new node token as the old node token - 1
>
>
> I had said +1 but you’re right that it’s actually -1 , sorry about that.
> You want the new node to be lower than the existing host. The lower token
> will take most of the data.
>
> I see the new node is streaming from the old node only. And the decom
> phase of the old node is extremely fast. Does this mean the new node will
> only take data ownership from the old node?
>
>
> With exactly three racks, yes. With more racks or fewer racks, no.
>
> I also did some cleanups after replacing node with old token - 1 and the
> cleanup sstable count was not increasing. Looks like adding a node with
> old_token - 1 and decom the old node will not generate stale data on the
> rest of the cluster. Do you know if  there are any edge cases that in this
> replacement process can generate any stale data on other nodes of the
> cluster with the setup I mentioned?
>
>
> Should do exactly what you want. I’d still run cleanup but it should be a
> no-op.
>
>
> Thanks,
> Runtian
>
> On Mon, May 8, 2023 at 9:59 PM Runtian Liu  wrote:
>
>> I thought the joining node would not participate in quorum? How are we
>> counting things like how many replicas ACK a write when we are adding a new
>> node for expansion? The token ownership won't change until the new node is
>> fully joined right?
>>
>> On Mon, May 8, 2023 at 8:58 PM Jeff Jirsa  wrote:
>>
>>> You can't have two nodes with the same token (in the current metadata
>>> implementation) - it causes problems counting things like how many replicas
>>> ACK a write, and what happens if the one you're replacing ACKs a write but
>>> the joining host doesn't? It's harder than it seems to maintain consistency
>>> guarantees in that model, because you have 2 nodes where either may end up
>>> becoming the sole true owner of the token, and you have to handle both
>>> cases where one of them fails.
>>>
>>> An easier option is to add it with new token set to old token +1 (as an
>>> expansion), then decom the leaving node (shrink). That'll minimize
>>> streaming when you decommission that node.
>>>
>>>
>>>
>>> On Mon, May 8, 2023 at 7:19 PM Runtian Liu  wrote:
>>>
 Hi all,

 Sometimes we want to replace a node for various reasons, we can replace
 a node by shutting down the old node and letting the new node stream data
 from other replicas, but this approach may have availability issues or data
 consistency issues if one more node in the same cluster went down. Why
 Cassandra doesn't support replacing a node without shutting down the old
 one? Can we treat the new node as normal node addition while having exactly
 the same token ranges as the node to be replaced. After the new node's
 joining process is complete, we just need to cut off the old node. With
 this, we don't lose any availability and the token range is not moved so no
 clean up is needed. Is there any downside of doing this?

 Thanks,
 Runtian

>>>


Re: Replacing node without shutting down the old node

2023-05-16 Thread Jaydeep Chovatia
Hi Jeff,

Do you think this is a good workaround to have in the Cassandra itself
until we have CEP-21

available and cleanup as part of the compaction in the Cassandra itself?
It can work as follows in Cassandra:
Step.1: Add a new flag in bootstrap, say
*-Dcopy_tokens_from=*. If set, then the newly joining node
will copy the tokens from *src_ip_address* and add "-1" to it
Step.2: Continue with the remaining bootstrap as is

Thoughts?

Jaydeep

On Tue, May 16, 2023 at 10:23 AM Runtian Liu  wrote:

> cool, thank you. This looks like a very good setup for us and cleanup
> should be very fast for this case.
>
> On Tue, May 16, 2023 at 5:53 AM Jeff Jirsa  wrote:
>
>>
>> In-line
>>
>> On May 15, 2023, at 5:26 PM, Runtian Liu  wrote:
>>
>> 
>> Hi Jeff,
>>
>> I tried the setup with vnode 16 and NetworkTopologyStrategy replication
>> strategy with replication factor 3 with 3 racks in one cluster. When using
>> the new node token as the old node token - 1
>>
>>
>> I had said +1 but you’re right that it’s actually -1 , sorry about that.
>> You want the new node to be lower than the existing host. The lower token
>> will take most of the data.
>>
>> I see the new node is streaming from the old node only. And the decom
>> phase of the old node is extremely fast. Does this mean the new node will
>> only take data ownership from the old node?
>>
>>
>> With exactly three racks, yes. With more racks or fewer racks, no.
>>
>> I also did some cleanups after replacing node with old token - 1 and the
>> cleanup sstable count was not increasing. Looks like adding a node with
>> old_token - 1 and decom the old node will not generate stale data on the
>> rest of the cluster. Do you know if  there are any edge cases that in this
>> replacement process can generate any stale data on other nodes of the
>> cluster with the setup I mentioned?
>>
>>
>> Should do exactly what you want. I’d still run cleanup but it should be a
>> no-op.
>>
>>
>> Thanks,
>> Runtian
>>
>> On Mon, May 8, 2023 at 9:59 PM Runtian Liu  wrote:
>>
>>> I thought the joining node would not participate in quorum? How are we
>>> counting things like how many replicas ACK a write when we are adding a new
>>> node for expansion? The token ownership won't change until the new node is
>>> fully joined right?
>>>
>>> On Mon, May 8, 2023 at 8:58 PM Jeff Jirsa  wrote:
>>>
 You can't have two nodes with the same token (in the current metadata
 implementation) - it causes problems counting things like how many replicas
 ACK a write, and what happens if the one you're replacing ACKs a write but
 the joining host doesn't? It's harder than it seems to maintain consistency
 guarantees in that model, because you have 2 nodes where either may end up
 becoming the sole true owner of the token, and you have to handle both
 cases where one of them fails.

 An easier option is to add it with new token set to old token +1 (as an
 expansion), then decom the leaving node (shrink). That'll minimize
 streaming when you decommission that node.



 On Mon, May 8, 2023 at 7:19 PM Runtian Liu  wrote:

> Hi all,
>
> Sometimes we want to replace a node for various reasons, we can
> replace a node by shutting down the old node and letting the new node
> stream data from other replicas, but this approach may have availability
> issues or data consistency issues if one more node in the same cluster 
> went
> down. Why Cassandra doesn't support replacing a node without shutting down
> the old one? Can we treat the new node as normal node addition while 
> having
> exactly the same token ranges as the node to be replaced. After the new
> node's joining process is complete, we just need to cut off the old node.
> With this, we don't lose any availability and the token range is not moved
> so no clean up is needed. Is there any downside of doing this?
>
> Thanks,
> Runtian
>