Hi Jeff, Do you think this is a good workaround to have in the Cassandra itself until we have CEP-21 <https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata> available and cleanup as part of the compaction in the Cassandra itself? It can work as follows in Cassandra: Step.1: Add a new flag in bootstrap, say *-Dcopy_tokens_from=<src_ip_address>*. If set, then the newly joining node will copy the tokens from *src_ip_address* and add "-1" to it Step.2: Continue with the remaining bootstrap as is
Thoughts? Jaydeep On Tue, May 16, 2023 at 10:23 AM Runtian Liu <curly...@gmail.com> wrote: > cool, thank you. This looks like a very good setup for us and cleanup > should be very fast for this case. > > On Tue, May 16, 2023 at 5:53 AM Jeff Jirsa <jji...@gmail.com> wrote: > >> >> In-line >> >> On May 15, 2023, at 5:26 PM, Runtian Liu <curly...@gmail.com> wrote: >> >> >> Hi Jeff, >> >> I tried the setup with vnode 16 and NetworkTopologyStrategy replication >> strategy with replication factor 3 with 3 racks in one cluster. When using >> the new node token as the old node token - 1 >> >> >> I had said +1 but you’re right that it’s actually -1 , sorry about that. >> You want the new node to be lower than the existing host. The lower token >> will take most of the data. >> >> I see the new node is streaming from the old node only. And the decom >> phase of the old node is extremely fast. Does this mean the new node will >> only take data ownership from the old node? >> >> >> With exactly three racks, yes. With more racks or fewer racks, no. >> >> I also did some cleanups after replacing node with old token - 1 and the >> cleanup sstable count was not increasing. Looks like adding a node with >> old_token - 1 and decom the old node will not generate stale data on the >> rest of the cluster. Do you know if there are any edge cases that in this >> replacement process can generate any stale data on other nodes of the >> cluster with the setup I mentioned? >> >> >> Should do exactly what you want. I’d still run cleanup but it should be a >> no-op. >> >> >> Thanks, >> Runtian >> >> On Mon, May 8, 2023 at 9:59 PM Runtian Liu <curly...@gmail.com> wrote: >> >>> I thought the joining node would not participate in quorum? How are we >>> counting things like how many replicas ACK a write when we are adding a new >>> node for expansion? The token ownership won't change until the new node is >>> fully joined right? >>> >>> On Mon, May 8, 2023 at 8:58 PM Jeff Jirsa <jji...@gmail.com> wrote: >>> >>>> You can't have two nodes with the same token (in the current metadata >>>> implementation) - it causes problems counting things like how many replicas >>>> ACK a write, and what happens if the one you're replacing ACKs a write but >>>> the joining host doesn't? It's harder than it seems to maintain consistency >>>> guarantees in that model, because you have 2 nodes where either may end up >>>> becoming the sole true owner of the token, and you have to handle both >>>> cases where one of them fails. >>>> >>>> An easier option is to add it with new token set to old token +1 (as an >>>> expansion), then decom the leaving node (shrink). That'll minimize >>>> streaming when you decommission that node. >>>> >>>> >>>> >>>> On Mon, May 8, 2023 at 7:19 PM Runtian Liu <curly...@gmail.com> wrote: >>>> >>>>> Hi all, >>>>> >>>>> Sometimes we want to replace a node for various reasons, we can >>>>> replace a node by shutting down the old node and letting the new node >>>>> stream data from other replicas, but this approach may have availability >>>>> issues or data consistency issues if one more node in the same cluster >>>>> went >>>>> down. Why Cassandra doesn't support replacing a node without shutting down >>>>> the old one? Can we treat the new node as normal node addition while >>>>> having >>>>> exactly the same token ranges as the node to be replaced. After the new >>>>> node's joining process is complete, we just need to cut off the old node. >>>>> With this, we don't lose any availability and the token range is not moved >>>>> so no clean up is needed. Is there any downside of doing this? >>>>> >>>>> Thanks, >>>>> Runtian >>>>> >>>>