Re: Paxos repairs in CEP-14

Henrik Ingo Sun, 05 Dec 2021 11:24:30 -0800

On Sun, 5 Dec 2021, 18.40 bened...@apache.org, <bened...@apache.org> wrote:


> > And at the end of the repair, this lower bound is known and stored
> somewhere?
>
> Yes, there is a new system.paxos_repair_history table
>
> > Under good conditions, I assume the result of a paxos repair is that all
> nodes received all LWT transactions from all other replicas?
>
> All in progress LWTs are flushed, essentially. They are either completed
> or invalidated. So there is a synchronisation point for the range being
> repaired, but there is no impact on any completed transactions. So even if
> paxos repair successfully sync’d all in progress transactions to every
> node, there could still be some past transactions that were persisted only
> to a majority of nodes, and these will be invisible to the paxos repair
> mechanism.


Cool. This clarifies.


There is no transaction log today in Cassandra to sync, so repair of the
> underlying data table is still the only way to guarantee data is
> synchronised to every node.
>

It's not the transaction log as such that I'm missing. (Or it is, but I
understand there isn't one.) What is hard to wrap my head around is how a
given partition can participate in a successful Paxos transaction even if
it might be completely unaware of the previous transaction to the same
partition. At least this is how I've understood this conversation?


> CEP-15 will change this, so that nodes will be fully consistent up to some
> logical timestamp, but CEP-14 does not change the underlying semantics of
> LWTs and Paxos in Cassandra.
>

Yes, looking forward to that. I just wanted to check whether CEP-14 would
possibly contain aome per partition version of the same ideas.

But even with everything you've explained, did I understand correctly that
(focusing on a single partition and only LWT writes...) I can in any event
stream commit logs from a majority of replicas, merge them, and such a
merged log must contain all committed transactions to that partition. (And
this should have nothing to do with the repair, then?)

Henrik



>
>
>
>
> From: Henrik Ingo <henrik.i...@datastax.com>
> Date: Sunday, 5 December 2021 at 11:45
> To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> Subject: Re: Paxos repairs in CEP-14
> On Sun, 5 Dec 2021, 1.45 bened...@apache.org, <bened...@apache.org> wrote:
>
> > > As the repair is only guaranteed for a majority of replicas, I assume I
> > can discover somewhere which replicas are up to date like this?
> >
> > I’m not quite sure what you mean. Do you mean which nodes have
> > participated in a paxos repair? This information isn’t maintained, but
> > anyway would not imply the node is up to date. A node participating in a
> > paxos repair ensures _a majority of other nodes_ are up-to-date with
> _its_
> > knowledge, give or take.
>
>
> Ah, thanks for clarifying. Indeed I was assuming the paxos repair happens
> the opposite way.
>
>
> By performing this on a majority of nodes, we ensure a majority of replicas
> > has a lower bound on the knowledge of a majority, and we effectively
> > invalidate any in-progress operations on any minority that did not
> > participate.
>
>
> And at the end of the repair, this lower bound is known and stored
> somewhere?
>
>
> > > Do I understand correctly, that if I take a backup from such a replica,
> > it is guaranteed to contain the full state up to a certain timestamp t?
> >
> > No, you would need to also perform regular repair afterwards. If you
> > perform a regular repair, by default it will now be preceded by a paxos
> > repair (which is typically very quick), so this will in fact hold, but
> > paxos repair won’t enforce it.
>
>
> Ok, so I'm trying to understand this...
>
> At the end of a Paxos repair, it is guaranteed that each LWT transaction
> has arrived at a majority of replicas. However, it's still not guaranteed
> that any single node would contain all transactions, because it could have
> been in a minority partition for some transactions. Correct so far?
>
> Under good conditions, I assume the result of a paxos repair is that all
> nodes received all LWT transactions from all other replicas? If some node
> is unavailable, that same node will be missing a bunch of transactions that
> it didn't receive repairs for?
>
>
> I'm thinking through this as I type, but I guess where I'm going is: in the
> universe of possible future work, does there exist a not-too-complex
> modification to CEP-14 where:
>
> 1. Node 1 concludes that a majority of its replicas appear to be available,
> and does its best to send all of its repairs to all of the replicas in that
> majority set.
>
> 2. Node 2 is able to learn that Node 1 successfully sent all of its repair
> writes to this set, and makes an attempt to do the same. If there are
> replicas in the set that it can't reach, they can be subtracted from the
> set, but the set still needs to contain a majority of replicas in the end.
>
> 3. At the end of all nodes doing the above, we would be left with a
> majority set of nodes that are known to - each individually - contain all
> LWT transactions up to the timestamp t.
>
> 4. A benefit of 3: A node N is not in the above majority set. It can now
> repair itself by communicating with a single node from the majority set,
> and copy its transaction log up to timestamp t. After doing so, it can join
> the majority set, as it now contains all transactions up to t.
>
> 5. For a longer outage it may not be possible for node N to ever catch up
> by replaying a serial transaction log. (Including for the reason an old
> enough log may no longer be available.) In this case traditional streaming
> repair would still be used.
>
> Based on your first reply, I guess none of the above is strictly needed to
> achieve the use case I outlined (backup, point in time restore,
> streaming...). It seems I'm attracted by the potential for simplicity of a
> setup where traditional repair is only needed as a fallback option.
> (Ultimately it's needed to bootstrap empty nodes anyway, so it wouldn't go
> away.)
>
>
>
>
>
> > > Does the replica also end up with a complete and continuous log of all
> > writes until t? If not, does a merge of all logs in the majority contain
> a
> > complete log?
> >
> > A majority. There is also no log that gets replicated for LWTs in
> > Cassandra. There is only ever at most one transaction that is in flight
> > (and that may complete) and whose result has not been persisted to some
> > majority, for any key. Paxos repair + repair means the result of the
> > implied log are replicated to all participants.
>
>
> I understand that Cassandra's LWT replication isn't based on replicating a
> single log. However I'm interested to understand whether it would be
> possible to end up with such a log as an outcome of the Paxos
> replication/repair process, since such a log can have other uses.
>
> Even with all of the above, I'm still left wondering: does the repair
> process (with the above modification, say) result in a node having all
> writes that happened before t, or is it only guaranteed to have the most
> recent value for each primary key?
>
>
> Henrik
>
> >
> > From: Henrik Ingo <henrik.i...@datastax.com>
> > Date: Saturday, 4 December 2021 at 23:12
> > To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> > Subject: Paxos repairs in CEP-14
> > Could someone elaborate on this section
> >
> > ****
> >
> > *Paxos Repair*
> > We will introduce a new repair mechanism, that can be run with or without
> > regular repair. This mechanism will:
> >
> >    - Track, per-replica, transactions that have been witnessed as
> initiated
> >    but have not been seen to complete
> >    - For a majority of replicas complete (either by invalidating,
> >    completing, or witnessing something newer) all operations they have
> >    witnessed as incomplete prior to the intiation of repair
> >    - Globally invalidate all promises issued prior to the most recent
> paxos
> >    repair
> >
> > ****
> >
> > Specific questions:
> >
> > Assuming a table only using these LWT:s
> >
> > * As the repair is only guaranteed for a majority of replicas, I assume I
> > can discover somewhere which replicas are up to date like this?
> >
> > * Do I understand correctly, that if I take a backup from such a replica,
> > it is guaranteed to contain the full state up to a certain timestamp t?
> > (And in addition may or may not contain mutations higher than t, which of
> > course could overwrite the value the same key had at t.)
> >
> > * Does the replica also end up with a complete and continuous log of all
> > writes until t? If not, does a merge of all logs in the majority contain
> a
> > complete log? In particular, I'm trying to parse the significance of "or
> > witnessing something newer"? (Use case for this last question could be
> > point in time restore, aka continuous backup, or also streaming writes
> to a
> > downstream system.)
> >
> > henrik
> > --
> >
> > Henrik Ingo
> >
> > +358 40 569 7354 <358405697354>
> >
> > [image: Visit us online.] <https://www.datastax.com/>  [image: Visit us
> on
> > Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on
> YouTube.]
> > <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=
> > >
> >   [image: Visit my LinkedIn profile.] <
> >
> https://urldefense.com/v3/__https://www.linkedin.com/in/heingo/__;!!PbtH5S7Ebw!MdcurXOpuWxUHjKnVzjfhaJq4ue7wGanA1bfx7tlIpTF9QEEKCpjvZNi43Q4AViXMNc$
> <
> https://urldefense.com/v3/__https:/www.linkedin.com/in/heingo/__;!!PbtH5S7Ebw!MdcurXOpuWxUHjKnVzjfhaJq4ue7wGanA1bfx7tlIpTF9QEEKCpjvZNi43Q4AViXMNc$
> >
> > >
> >
>

Re: Paxos repairs in CEP-14

Reply via email to