Re: CEP-21 and complete cluster replacement.

Alex Petrov Thu, 20 Oct 2022 08:26:30 -0700

Since it might have sounded differently, most of the things I wrote are 
something that CEP-21 _enables_ us to do.


But CEP-21 will just (more or less) make cluster operations consistent. The 
rest of the things - are just features that we will implement on top of it. We 
will need people to adopt new tooling to make most of the operations I describe 
available.

On Thu, Oct 20, 2022, at 4:42 PM, Alex Petrov wrote:
> > by default C* does prohibit concurrent bootstraps (behaviour which can be 
> > overridden with the cassandra.consistent.rangemovement system property). 
> > But there's nothing to stop you fully bootstrapping additional nodes in 
> > series, then removing them in the same way.
> 
> I think there are multiple important things in which CEP-21 actually might be 
> helpful. Right now, in a 5-node cluster with RF-3, each node is holding a 
> range that is between its own token and its predecessor in the ring, along 
> with RF-1 ranges replicated from the neighbours. 
> 
> What CEP-21 will allow us to do is to make _some_ RF-sized subset of 5 nodes 
> we have in the cluster be owners of an arbitrary range. That will _also_ mean 
> that you can add a 6th node, that owns nothing at first, and bootstrap it as 
> a participant for read/write quorums of the same ranges node A is a 
> read/write replica of, and, in the next step, remove A as a read/write 
> replica.
> 
> I believe such approach would still be incredibly costly (i.e. you will have 
> to re-stream entire data), but if there are other means are available for 
> sharing disk or sstables that would lower the cost for you, this might even 
> work as a lower-risk upgrade option, even though I think most operators won't 
> be using this. What could be widely beneficial is having an ability to test 
> new version as a canary in a write-survey mode, and then add it as a read 
> replica, but for a small subset of data (effectively decreasing availability 
> of this particular range by extending its RF).
> 
> > What you will be able to do post CEP-21, is to run concurrent bootstraps of 
> > nodes which don't share ranges
> 
> I think we can do even better: we can take an arbitrary range, and split it 
> into N parts, effectively making all N items bootstrappable in parallel. I 
> also think (however I haven't checked if that's truly the case) that we can 
> prepare the plan in which we can allow executing StartJoin for all nodes, 
> while the range is locked, but block execution of `MidJoin` for any of the 
> nodes until StartJoin for all of them is executed and, similarly, throttling 
> FinishJoin before MidJoin is executed for all the nodes. In other words, I 
> think there might be a bit of a room for flexibility, the question is what 
> way will be the most beneficial. 
> 
> On Thu, Oct 20, 2022, at 3:33 PM, Sam Tunnicliffe wrote:
>> > Add A' to the cluster with the same keyspace as A.
>> 
>> Can you clarify what you mean here?
>> 
>> > Currently these operations have to be performed in sequence.  My 
>> > understanding is that you can't add more than one node at a time.  
>> 
>> To ensure consistency guarantees are honoured, by default C* does prohibit 
>> concurrent bootstraps (behaviour which can be overridden with the 
>> cassandra.consistent.rangemovement system property). But there's nothing to 
>> stop you fully bootstrapping additional nodes in series, then removing them 
>> in the same way.
>> 
>> Why you would want to do this, or to use bootstrap and remove for this at 
>> all rather than upgrading in place isn't clear to me though, doing it this 
>> way just adds a streaming overhead that doesn't otherwise exist.
>> 
>> What you will be able to do post CEP-21, is to run concurrent bootstraps of 
>> nodes which don't share ranges. This is a definite an improvement on the 
>> status quo, but it's only an initial step. CEP-21 is intended to lay the 
>> foundations for further improvements down the line.
>> 
>> 
>>> On 20 Oct 2022, at 14:04, Claude Warren, Jr via dev 
>>> <dev@cassandra.apache.org> wrote:
>>> 
>>> My understanding of our process is (assuming we have 3 nodes A,B,C):
>>>  * Add A' to the cluster with the same keyspace as A.
>>>  * Remove A from the cluster.
>>>  * Add B' to the cluster
>>>  * Remove B from the cluster
>>>  * Add C' to the cluster
>>>  * Remove C from the cluster.
>>> Currently these operations have to be performed in sequence.  My 
>>> understanding is that you can't add more than one node at a time.  What we 
>>> would like to do is do this is 3 steps:
>>>  * Add A', B', C' to the cluster.
>>>  * Wait for all 3 to be accepted and functioning.
>>>  * Remove A, B, C from the cluster.
>>> Does CEP-21 make this possible?
>>> 
>>> On Thu, Oct 20, 2022 at 1:43 PM Sam Tunnicliffe <s...@beobal.com> wrote:
>>>> I'm not sure I 100% understand the question, but the things covered in 
>>>> CEP-21 won't enable you to as an operator to bootstrap all your new nodes 
>>>> without fully joining, then perform an atomic CAS to replace the existing 
>>>> members. CEP-21 alone also won't solve all cross-version streaming issues, 
>>>> which is one reason performing topology-modifying operations like 
>>>> bootstrap & decommission during an upgrade are not generally considered a 
>>>> good idea.
>>>> 
>>>> Transactional metadata will make the bootstrapping (and decommissioning) 
>>>> experience a whole lot more stable and predictable, so in the short term I 
>>>> would expect the recommended rolling approach to upgrades would improve 
>>>> significantly. 
>>>> 
>>>> 
>>>> > On 20 Oct 2022, at 12:24, Claude Warren, Jr via dev 
>>>> > <dev@cassandra.apache.org> wrote:
>>>> > 
>>>> > After CEP-21 would it be possible to take a cluster of 6 nodes, spin up 
>>>> > 6 new nodes to duplicate the 6 existing nodes and then spin down the 
>>>> > original 6 nodes.  Basically, I am thinking of the case where a cluster 
>>>> > is running version x.y.z and want to run x.y.z+1, can they spin up an 
>>>> > equal number of x.y.z+1 systems and replace the old ones without 
>>>> > shutting down the cluster?
>>>> > 
>>>> > We currently try something like this where we spin up 1 system and then 
>>>> > drop 1 system until all the old nodes are replaced.  This process 
>>>> > frequently runs into streaming failures while bootstrapping.
>>>> > 
>>>> > Any insights would be appreciated.
>>>> > 
>>>> > Claude
>>>> 
>

Re: CEP-21 and complete cluster replacement.

Reply via email to