Hi Jeff, I tried the setup with vnode 16 and NetworkTopologyStrategy replication strategy with replication factor 3 with 3 racks in one cluster. When using the new node token as the old node token - 1, I see the new node is streaming from the old node only. And the decom phase of the old node is extremely fast. Does this mean the new node will only take data ownership from the old node? I also did some cleanups after replacing node with old token - 1 and the cleanup sstable count was not increasing. Looks like adding a node with old_token - 1 and decom the old node will not generate stale data on the rest of the cluster. Do you know if there are any edge cases that in this replacement process can generate any stale data on other nodes of the cluster with the setup I mentioned?
Thanks, Runtian On Mon, May 8, 2023 at 9:59 PM Runtian Liu <curly...@gmail.com> wrote: > I thought the joining node would not participate in quorum? How are we > counting things like how many replicas ACK a write when we are adding a new > node for expansion? The token ownership won't change until the new node is > fully joined right? > > On Mon, May 8, 2023 at 8:58 PM Jeff Jirsa <jji...@gmail.com> wrote: > >> You can't have two nodes with the same token (in the current metadata >> implementation) - it causes problems counting things like how many replicas >> ACK a write, and what happens if the one you're replacing ACKs a write but >> the joining host doesn't? It's harder than it seems to maintain consistency >> guarantees in that model, because you have 2 nodes where either may end up >> becoming the sole true owner of the token, and you have to handle both >> cases where one of them fails. >> >> An easier option is to add it with new token set to old token +1 (as an >> expansion), then decom the leaving node (shrink). That'll minimize >> streaming when you decommission that node. >> >> >> >> On Mon, May 8, 2023 at 7:19 PM Runtian Liu <curly...@gmail.com> wrote: >> >>> Hi all, >>> >>> Sometimes we want to replace a node for various reasons, we can replace >>> a node by shutting down the old node and letting the new node stream data >>> from other replicas, but this approach may have availability issues or data >>> consistency issues if one more node in the same cluster went down. Why >>> Cassandra doesn't support replacing a node without shutting down the old >>> one? Can we treat the new node as normal node addition while having exactly >>> the same token ranges as the node to be replaced. After the new node's >>> joining process is complete, we just need to cut off the old node. With >>> this, we don't lose any availability and the token range is not moved so no >>> clean up is needed. Is there any downside of doing this? >>> >>> Thanks, >>> Runtian >>> >>