On Sun, 5 Dec 2021, 18.40 bened...@apache.org, <bened...@apache.org> wrote:
> > And at the end of the repair, this lower bound is known and stored > somewhere? > > Yes, there is a new system.paxos_repair_history table > > > Under good conditions, I assume the result of a paxos repair is that all > nodes received all LWT transactions from all other replicas? > > All in progress LWTs are flushed, essentially. They are either completed > or invalidated. So there is a synchronisation point for the range being > repaired, but there is no impact on any completed transactions. So even if > paxos repair successfully sync’d all in progress transactions to every > node, there could still be some past transactions that were persisted only > to a majority of nodes, and these will be invisible to the paxos repair > mechanism. Cool. This clarifies. There is no transaction log today in Cassandra to sync, so repair of the > underlying data table is still the only way to guarantee data is > synchronised to every node. > It's not the transaction log as such that I'm missing. (Or it is, but I understand there isn't one.) What is hard to wrap my head around is how a given partition can participate in a successful Paxos transaction even if it might be completely unaware of the previous transaction to the same partition. At least this is how I've understood this conversation? > CEP-15 will change this, so that nodes will be fully consistent up to some > logical timestamp, but CEP-14 does not change the underlying semantics of > LWTs and Paxos in Cassandra. > Yes, looking forward to that. I just wanted to check whether CEP-14 would possibly contain aome per partition version of the same ideas. But even with everything you've explained, did I understand correctly that (focusing on a single partition and only LWT writes...) I can in any event stream commit logs from a majority of replicas, merge them, and such a merged log must contain all committed transactions to that partition. (And this should have nothing to do with the repair, then?) Henrik > > > > > From: Henrik Ingo <henrik.i...@datastax.com> > Date: Sunday, 5 December 2021 at 11:45 > To: dev@cassandra.apache.org <dev@cassandra.apache.org> > Subject: Re: Paxos repairs in CEP-14 > On Sun, 5 Dec 2021, 1.45 bened...@apache.org, <bened...@apache.org> wrote: > > > > As the repair is only guaranteed for a majority of replicas, I assume I > > can discover somewhere which replicas are up to date like this? > > > > I’m not quite sure what you mean. Do you mean which nodes have > > participated in a paxos repair? This information isn’t maintained, but > > anyway would not imply the node is up to date. A node participating in a > > paxos repair ensures _a majority of other nodes_ are up-to-date with > _its_ > > knowledge, give or take. > > > Ah, thanks for clarifying. Indeed I was assuming the paxos repair happens > the opposite way. > > > By performing this on a majority of nodes, we ensure a majority of replicas > > has a lower bound on the knowledge of a majority, and we effectively > > invalidate any in-progress operations on any minority that did not > > participate. > > > And at the end of the repair, this lower bound is known and stored > somewhere? > > > > > Do I understand correctly, that if I take a backup from such a replica, > > it is guaranteed to contain the full state up to a certain timestamp t? > > > > No, you would need to also perform regular repair afterwards. If you > > perform a regular repair, by default it will now be preceded by a paxos > > repair (which is typically very quick), so this will in fact hold, but > > paxos repair won’t enforce it. > > > Ok, so I'm trying to understand this... > > At the end of a Paxos repair, it is guaranteed that each LWT transaction > has arrived at a majority of replicas. However, it's still not guaranteed > that any single node would contain all transactions, because it could have > been in a minority partition for some transactions. Correct so far? > > Under good conditions, I assume the result of a paxos repair is that all > nodes received all LWT transactions from all other replicas? If some node > is unavailable, that same node will be missing a bunch of transactions that > it didn't receive repairs for? > > > I'm thinking through this as I type, but I guess where I'm going is: in the > universe of possible future work, does there exist a not-too-complex > modification to CEP-14 where: > > 1. Node 1 concludes that a majority of its replicas appear to be available, > and does its best to send all of its repairs to all of the replicas in that > majority set. > > 2. Node 2 is able to learn that Node 1 successfully sent all of its repair > writes to this set, and makes an attempt to do the same. If there are > replicas in the set that it can't reach, they can be subtracted from the > set, but the set still needs to contain a majority of replicas in the end. > > 3. At the end of all nodes doing the above, we would be left with a > majority set of nodes that are known to - each individually - contain all > LWT transactions up to the timestamp t. > > 4. A benefit of 3: A node N is not in the above majority set. It can now > repair itself by communicating with a single node from the majority set, > and copy its transaction log up to timestamp t. After doing so, it can join > the majority set, as it now contains all transactions up to t. > > 5. For a longer outage it may not be possible for node N to ever catch up > by replaying a serial transaction log. (Including for the reason an old > enough log may no longer be available.) In this case traditional streaming > repair would still be used. > > Based on your first reply, I guess none of the above is strictly needed to > achieve the use case I outlined (backup, point in time restore, > streaming...). It seems I'm attracted by the potential for simplicity of a > setup where traditional repair is only needed as a fallback option. > (Ultimately it's needed to bootstrap empty nodes anyway, so it wouldn't go > away.) > > > > > > > > Does the replica also end up with a complete and continuous log of all > > writes until t? If not, does a merge of all logs in the majority contain > a > > complete log? > > > > A majority. There is also no log that gets replicated for LWTs in > > Cassandra. There is only ever at most one transaction that is in flight > > (and that may complete) and whose result has not been persisted to some > > majority, for any key. Paxos repair + repair means the result of the > > implied log are replicated to all participants. > > > I understand that Cassandra's LWT replication isn't based on replicating a > single log. However I'm interested to understand whether it would be > possible to end up with such a log as an outcome of the Paxos > replication/repair process, since such a log can have other uses. > > Even with all of the above, I'm still left wondering: does the repair > process (with the above modification, say) result in a node having all > writes that happened before t, or is it only guaranteed to have the most > recent value for each primary key? > > > Henrik > > > > > From: Henrik Ingo <henrik.i...@datastax.com> > > Date: Saturday, 4 December 2021 at 23:12 > > To: dev@cassandra.apache.org <dev@cassandra.apache.org> > > Subject: Paxos repairs in CEP-14 > > Could someone elaborate on this section > > > > **** > > > > *Paxos Repair* > > We will introduce a new repair mechanism, that can be run with or without > > regular repair. This mechanism will: > > > > - Track, per-replica, transactions that have been witnessed as > initiated > > but have not been seen to complete > > - For a majority of replicas complete (either by invalidating, > > completing, or witnessing something newer) all operations they have > > witnessed as incomplete prior to the intiation of repair > > - Globally invalidate all promises issued prior to the most recent > paxos > > repair > > > > **** > > > > Specific questions: > > > > Assuming a table only using these LWT:s > > > > * As the repair is only guaranteed for a majority of replicas, I assume I > > can discover somewhere which replicas are up to date like this? > > > > * Do I understand correctly, that if I take a backup from such a replica, > > it is guaranteed to contain the full state up to a certain timestamp t? > > (And in addition may or may not contain mutations higher than t, which of > > course could overwrite the value the same key had at t.) > > > > * Does the replica also end up with a complete and continuous log of all > > writes until t? If not, does a merge of all logs in the majority contain > a > > complete log? In particular, I'm trying to parse the significance of "or > > witnessing something newer"? (Use case for this last question could be > > point in time restore, aka continuous backup, or also streaming writes > to a > > downstream system.) > > > > henrik > > -- > > > > Henrik Ingo > > > > +358 40 569 7354 <358405697354> > > > > [image: Visit us online.] <https://www.datastax.com/> [image: Visit us > on > > Twitter.] <https://twitter.com/DataStaxEng> [image: Visit us on > YouTube.] > > < > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e= > > > > > [image: Visit my LinkedIn profile.] < > > > https://urldefense.com/v3/__https://www.linkedin.com/in/heingo/__;!!PbtH5S7Ebw!MdcurXOpuWxUHjKnVzjfhaJq4ue7wGanA1bfx7tlIpTF9QEEKCpjvZNi43Q4AViXMNc$ > < > https://urldefense.com/v3/__https:/www.linkedin.com/in/heingo/__;!!PbtH5S7Ebw!MdcurXOpuWxUHjKnVzjfhaJq4ue7wGanA1bfx7tlIpTF9QEEKCpjvZNi43Q4AViXMNc$ > > > > > > > >