cool, thank you. This looks like a very good setup for us and cleanup should be very fast for this case.
On Tue, May 16, 2023 at 5:53 AM Jeff Jirsa <jji...@gmail.com> wrote: > > In-line > > On May 15, 2023, at 5:26 PM, Runtian Liu <curly...@gmail.com> wrote: > > > Hi Jeff, > > I tried the setup with vnode 16 and NetworkTopologyStrategy replication > strategy with replication factor 3 with 3 racks in one cluster. When using > the new node token as the old node token - 1 > > > I had said +1 but you’re right that it’s actually -1 , sorry about that. > You want the new node to be lower than the existing host. The lower token > will take most of the data. > > I see the new node is streaming from the old node only. And the decom > phase of the old node is extremely fast. Does this mean the new node will > only take data ownership from the old node? > > > With exactly three racks, yes. With more racks or fewer racks, no. > > I also did some cleanups after replacing node with old token - 1 and the > cleanup sstable count was not increasing. Looks like adding a node with > old_token - 1 and decom the old node will not generate stale data on the > rest of the cluster. Do you know if there are any edge cases that in this > replacement process can generate any stale data on other nodes of the > cluster with the setup I mentioned? > > > Should do exactly what you want. I’d still run cleanup but it should be a > no-op. > > > Thanks, > Runtian > > On Mon, May 8, 2023 at 9:59 PM Runtian Liu <curly...@gmail.com> wrote: > >> I thought the joining node would not participate in quorum? How are we >> counting things like how many replicas ACK a write when we are adding a new >> node for expansion? The token ownership won't change until the new node is >> fully joined right? >> >> On Mon, May 8, 2023 at 8:58 PM Jeff Jirsa <jji...@gmail.com> wrote: >> >>> You can't have two nodes with the same token (in the current metadata >>> implementation) - it causes problems counting things like how many replicas >>> ACK a write, and what happens if the one you're replacing ACKs a write but >>> the joining host doesn't? It's harder than it seems to maintain consistency >>> guarantees in that model, because you have 2 nodes where either may end up >>> becoming the sole true owner of the token, and you have to handle both >>> cases where one of them fails. >>> >>> An easier option is to add it with new token set to old token +1 (as an >>> expansion), then decom the leaving node (shrink). That'll minimize >>> streaming when you decommission that node. >>> >>> >>> >>> On Mon, May 8, 2023 at 7:19 PM Runtian Liu <curly...@gmail.com> wrote: >>> >>>> Hi all, >>>> >>>> Sometimes we want to replace a node for various reasons, we can replace >>>> a node by shutting down the old node and letting the new node stream data >>>> from other replicas, but this approach may have availability issues or data >>>> consistency issues if one more node in the same cluster went down. Why >>>> Cassandra doesn't support replacing a node without shutting down the old >>>> one? Can we treat the new node as normal node addition while having exactly >>>> the same token ranges as the node to be replaced. After the new node's >>>> joining process is complete, we just need to cut off the old node. With >>>> this, we don't lose any availability and the token range is not moved so no >>>> clean up is needed. Is there any downside of doing this? >>>> >>>> Thanks, >>>> Runtian >>>> >>>