Hi Henrik, > While I understand they are out of scope, do you happen to have already some > idea what it would require to support secondary indexes?
Yes, it is likely that the approach will be the same taken by Calvin-like systems where a “reconnaissance” round is taken within the local DC to construct a transaction involving the secondary index. This would be the reverse if reading from a secondary index, where the primary keys would be determined via a reconnaissance round and the transaction updated to include them. If we choose to implement one of the more sophisticated interactive transaction proposals then it would of course be possible to implement secondary indexes on top of these. Note that all of this is entirely independent of SAI – since these indexes are built per-partition they will be easily transactional within a partition key, or probably never transactional if you perform a scatter gather across the whole cluster. I’m not sufficiently well versed in SAI to really consider this well as yet, and I will update the CEP to note that they are out of scope. > Typical value for SkewMax in e.g. the Spanner paper, some CockroachDB > discussions = 7 ms I think skew max is likely to be much lower than this, even on commodity hardware. Bear in mind that unlike Cockroach and Spanner correctness does not depend on this value, only performance. So we can pick the real number, not some p100 outlier value. Also bear in mind that this is an optimisation. In clusters where it makes no sense we can simply use the raw protocol and accept transactions will very infrequently take two round-trips (which is fine, because in this scenario round-trips are cheap). > A known optimization for the hot rows problem is to "hint" or manually force > clients to direct all updates to the hot row to the same node So, with a leaderless protocol like Accord the ordering decisions are never really bottlenecked - no matter how many are in-flight, a new transaction will experience no additional latency determining its execution order. The only bottleneck will be execution. For this it is absolutely possible to funnel everything to a single coordinator, but I don’t know that this would in practice achieve much – the important bottleneck would be that the coordinators are all within the same DC, so that the _replicas_ may all respond to them with their data dependencies with minimal delay. This is something we discussed in the ApacheCon call as it happens. If a significant number of transactions are pending, and they are in different DCs, it would be quite straightforward to nominate a coordinator within the DC serving the majority of operations to serve the remainder, and to forward the results to the original coordinators. I don’t anticipate this optimisation being a high priority until we have user reports of this bottleneck in the wild, however. Since clients for many workloads will naturally be geo-partitioned so that related state is being updated from the same region, it might simply not be needed – at least any time soon. From: Henrik Ingo <henrik.i...@datastax.com> Date: Friday, 1 October 2021 at 14:38 To: dev@cassandra.apache.org <dev@cassandra.apache.org> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions Hi Benedict Since you asked, I reviewed the thread a bit and found this... *secondary indexes* >> What I would like to understand better and without guessing is, what do these transactions look like from a client/user point of view? > This is a fair question, and perhaps something I should pinpoint more directly for the reader. The CEP does stipulate non-interactive transactions, i.e. those that are one-shot. The only other limitation is that the partition keys must be known upfront, however I expect we will follow-up soon after with some weaker semantics that build on top (probably using optimistic concurrency control) to support transactions where only some partition keys are known upfront, so that we may support global secondary indexes with proper isolation and consistency. The CEP doesn't actually mention lack of support for secondary index queries. Probably good to add as a limitation. (I realize currently using secondary indexes isn't mainstream in Cassandra anyway, but with SASI in 4.0 and SAI being a separate CEP in discussion, it's good to call out Accord wouldn't automatically support them.) While I understand they are out of scope, do you happen to have already some idea what it would require to support secondary indexes? Is it sufficient to just include the secondary index keys (or a range of such) in the "deps" of the transaction? Of course, still needing to also include the partitions or rows actuallly read as a result of scanning the secondary index. Similarly then for mutations, deps would have to include changes to index keys in the transaction? *commit latency* A topic on some off-list discussions has been to understand the implications of using a Spanner-inspired approach where the clock skew between cluster nodes is a necessary part of the commit latency: Deadline(t0 ,C,P) = t0 +SkewMax +max(Latency(C′,P) |C′ ∈C)−Latency(C,P) In the white paper you even explicitly mention the trade off you have chosen: *"This technique trades wide area round-trips for an additional latency penalty equal to the bounds on clock synchrony."* If we try to quantify what this trade off means in practice, I get: Typical value for SkewMax in e.g. the Spanner paper, some CockroachDB discussions = 7 ms. Maybe 10 - 20 ms if you don't have Google-level hardware. Common network latencies in a globally distributed cluster: US West - East = 60 ms US East - EU Central = 100 ms US/EU to APAC, Africa, LATAM = 100-200 ms Source: https://www.cloudping.co/grid The conclusion is that this tradeoff definitely makes sense for globally distributed transactions. This resembles QUORUM writes in current Cassandra. However, users commonly prefer LOCAL_QUORUM in current Cassandra. I read that this was discussed in the phone call, but haven't read about a specific proposal. Just for the sake of completing my math, let's assume that some LOCAL_QUORUM style Accord commit is invented. A naive example could be to simply deploy a Cassandra cluster *with Accord transactions* in a single geographical region, and other geographical regions would be served by some external replication mechanism and would have to be read-only. Whatever the (hypothetical) solution, for LOCAL_QUORUM style or just single region commits we end up with: Typical SkewMax = 7 - 20 ms Network latency < 1 ms. It seems the SkewMax is quite high for a cluster deployed in a single region, and what's worse there's no way to avoid it or make it much smaller than 7 ms? The only solution that comes to mind while writing this is to design Accord to be pluggable such that the consensus part could be switched to something that uses a logical clock for the transaction id. The user would choose one or the other depending on what they optimize for. I'll finish with a few notes: Commit latency in itself isn't categorically bad for performance. I've worked with several implementations of distributed databases that provide good throughput even when a single write has high latency due to geography/speed of light. However, the duration of a commit is the window during which other transactions may conflict with the committing transaction. Thus commit latency will either increase the likelihood of aborted transactions, or in other concurrency mechanisms block and impose a max throughput for hot rows. A known optimization for the hot rows problem is to "hint" or manually force clients to direct all updates to the hot row to the same node, essentially making the system leader based. This allows the database to start processing new updates even while the first one is still committing. (See Galera for an example implementing this <https://galeracluster.com/library/documentation/using-sr.html#usr-hot-records>.) This makes me wonder whether there is a similar optimization for Accord where transactions from the same coordinator can be allowed to commit within the SkewMax window, because we can assume that the trx timestamps originating at the same coordinator cannot arrive out of order when using TPC? henrik On Mon, Sep 27, 2021 at 11:59 PM bened...@apache.org <bened...@apache.org> wrote: > Ok, it’s time for the weekly poking of the hornet’s nest. > > Any more thoughts, questions or criticisms, anyone? > > From: bened...@apache.org <bened...@apache.org> > Date: Friday, 24 September 2021 at 22:41 > To: dev@cassandra.apache.org <dev@cassandra.apache.org> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions > I’m not aware of anybody having taken any notes, but somebody please chime > in if I’m wrong. > > From my recollection, re Accord: > > > * Q: Will batches now support rollbacks? > * Batches would apply atomically or not, but unlikely to have a > concept of rollback. Timeouts remain unknown, but hope to have some > mechanism to provide clients a definitive answer about such transactions > after the fact. > * Q: Can stale replicas participate in transactions? > * Accord applies conflicting transactions in-order at every > replica, so only nodes that are up-to-date may participate in the execution > of a transaction, but any replica may participate in agreeing a > transaction. To ensure replicas remain up-to-date I anticipate introducing > a real-time repair facility at the transactional message level, with peers > reconciling recently processed messages and cross-delivering any that are > missing. > * Possible UX directions in very vague terms: CQL atomic and > conditional batches initially; going forwards interactive transactions? > Complex user defined functions? SQL? > * Discussed possibility of LOCAL_QUORUM reads for globally replicated > transactional tables, as this is an important use case > * Simple stale reads to transactional tables > * Brainstormed a bit about serializable reads to a single DC > without (normally) crossing WAN > * Discussed possibility of multiple ACKs providing separate LAN and > WAN persistence notifications to clients > * Discussed size of fast path quorums in Accord, and how this might > affect global latency in high RF clusters (i.e. not optimal, and in some > cases may need every DC to participate) and how this can be modified by > biasing fast path electorate so that 2 of the 3 DCs may reach fast-path > decisions with each other (remaining DC having to reach both those DCs to > reach fast path). Also discussed Calvin-like modes of operation that would > offer optimal global latency for sufficiently small clusters at RF=3 or > RF=5. > > I’m sure there were other discussions I can’t remember, perhaps others can > fill in the blanks. > > > From: Jonathan Ellis <jbel...@gmail.com> > Date: Friday, 24 September 2021 at 20:28 > To: dev <dev@cassandra.apache.org> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions > Does anyone have notes for those of us who couldn't make the call? > > On Wed, Sep 22, 2021 at 1:35 PM bened...@apache.org <bened...@apache.org> > wrote: > > > Hi everyone, > > > > Joey has helpfully arranged a call for tomorrow at 8am PST / 10am CST / > > 4pm BST to discuss Accord and other things in the community. There are no > > plans to make any kind of project decisions. Everyone is welcome to drop > in > > to discuss Accord or whatever else might be on your mind. > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__gather.town_app_2UKSboSjqKXIXliE_ac2021-2Dcass-2Dsocial&d=DwIF-g&c=adz96Xi0w1RHqtPMowiL2g&r=eYcKRCU2ISzgciHbxg_tERbSQOZMMscdGLftkLqUuXo&m=yN7Y6u6BfW9NUZaSousZnD2Y-WiBtM1xDeJNy2WEq_r-gZqFwHVT4IPaeMOUa-AF&s=cgKblfbz9lUghSPbj5Si7oM7RsZy1w9vfvWjyzL8MXs&e= > > > > > > From: bened...@apache.org <bened...@apache.org> > > Date: Wednesday, 22 September 2021 at 16:22 > > To: dev@cassandra.apache.org <dev@cassandra.apache.org> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions > > No, I would expect to deliver strict serializable interactive > transactions > > using Accord. These would simply corroborate that the participating keys > > had not modified their write timestamps during the final transaction. > These > > could even be undertaken with still only a single wide area round-trip, > > using local copies of the data to assemble the transaction (though this > > would marginally increase the chance of aborts) > > > > My goal for MVCC is parallelism, not additional isolation levels (though > > snapshot isolation is useful and we’ll probably also want to offer that > > eventually) > > > > From: Henrik Ingo <henrik.i...@datastax.com> > > Date: Wednesday, 22 September 2021 at 15:15 > > To: dev@cassandra.apache.org <dev@cassandra.apache.org> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions > > On Wed, Sep 22, 2021 at 7:56 AM bened...@apache.org <bened...@apache.org > > > > wrote: > > > > > Could you explain why you believe this trade-off is necessary? We can > > > support full SQL just fine with Accord, and I hope that we eventually > do > > so. > > > > > > > I assume this is really referring to interactive transactions = multiple > > round trips to the client within a transaction. > > > > You mentioned previously we could later build a more MVCC like > transaction > > semantic on top of Accord. (Independent reads from a single snapshot, > > followed by a commit using Accord.) In this case I think the relevant > > discussion is whether Accord is still the optimal building block > > performance wise to do so, or whether users would then have lower > > consistency level but still pay the performance cost of a stricter > > consistency level. > > > > henrik > > -- > > > > Henrik Ingo > > > > +358 40 569 7354 <358405697354> > > > > [image: Visit us online.] <https://www.datastax.com/> [image: Visit us > on > > Twitter.] <https://twitter.com/DataStaxEng> [image: Visit us on > YouTube.] > > < > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e= > > > > > [image: Visit my LinkedIn profile.] < > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_heingo_&d=DwIF-g&c=adz96Xi0w1RHqtPMowiL2g&r=eYcKRCU2ISzgciHbxg_tERbSQOZMMscdGLftkLqUuXo&m=yN7Y6u6BfW9NUZaSousZnD2Y-WiBtM1xDeJNy2WEq_r-gZqFwHVT4IPaeMOUa-AF&s=hWWsWoR24lF18raNqjeqYEL56ZMWgN4slrOU_-RYwQg&e= > > > > > > > > -- > Jonathan Ellis > co-founder, http://www.datastax.com > @spyced > -- Henrik Ingo +358 40 569 7354 <358405697354> [image: Visit us online.] <https://www.datastax.com/> [image: Visit us on Twitter.] <https://twitter.com/DataStaxEng> [image: Visit us on YouTube.] <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=> [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>