Re: [DISCUSS] CEP-15: General Purpose Transactions

bened...@apache.org Fri, 01 Oct 2021 07:30:27 -0700

Hi Henrik,

> While I understand they are out of scope, do you happen to have already some 
> idea what it would require to support secondary indexes?

Yes, it is likely that the approach will be the same taken by Calvin-like 
systems where a “reconnaissance” round is taken within the local DC to 
construct a transaction involving the secondary index. This would be the 
reverse if reading from a secondary index, where the primary keys would be 
determined via a reconnaissance round and the transaction updated to include 
them.

If we choose to implement one of the more sophisticated interactive transaction 
proposals then it would of course be possible to implement secondary indexes on 
top of these.

Note that all of this is entirely independent of SAI – since these indexes are 
built per-partition they will be easily transactional within a partition key, 
or probably never transactional if you perform a scatter gather across the 
whole cluster. I’m not sufficiently well versed in SAI to really consider this 
well as yet, and I will update the CEP to note that they are out of scope.

> Typical value for SkewMax in e.g. the Spanner paper, some CockroachDB 
> discussions = 7 ms

I think skew max is likely to be much lower than this, even on commodity 
hardware. Bear in mind that unlike Cockroach and Spanner correctness does not 
depend on this value, only performance. So we can pick the real number, not 
some p100 outlier value.

Also bear in mind that this is an optimisation. In clusters where it makes no 
sense we can simply use the raw protocol and accept transactions will very 
infrequently take two round-trips (which is fine, because in this scenario 
round-trips are cheap).

> A known optimization for the hot rows problem is to "hint" or manually force 
> clients to direct all updates to the hot row to the same node

So, with a leaderless protocol like Accord the ordering decisions are never 
really bottlenecked - no matter how many are in-flight, a new transaction will 
experience no additional latency determining its execution order. The only 
bottleneck will be execution. For this it is absolutely possible to funnel 
everything to a single coordinator, but I don’t know that this would in 
practice achieve much – the important bottleneck would be that the coordinators 
are all within the same

DC, so that the _replicas_ may all respond to them with their data dependencies 
with minimal delay. This is something we discussed in the ApacheCon call as it 
happens. If a significant number of transactions are pending, and they are in 
different DCs, it would be quite straightforward to nominate a coordinator 
within the DC serving the majority of operations to serve the remainder, and to 
forward the results to the original coordinators.

I don’t anticipate this optimisation being a high priority until we have user 
reports of this bottleneck in the wild, however. Since clients for many 
workloads will naturally be geo-partitioned so that related state is being 
updated from the same region, it might simply not be needed – at least any time 
soon.

From: Henrik Ingo <henrik.i...@datastax.com>
Date: Friday, 1 October 2021 at 14:38
To: dev@cassandra.apache.org <dev@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Hi Benedict

Since you asked, I reviewed the thread a bit and found this...

*secondary indexes*

>> What I would like to understand better and without guessing is, what do
these transactions look like from a client/user point of view?

> This is a fair question, and perhaps something I should pinpoint more
directly for the reader. The CEP does stipulate non-interactive
transactions, i.e. those that are one-shot. The only other limitation is
that the partition keys must be known upfront, however I expect we will
follow-up soon after with some weaker semantics that build on top (probably
using optimistic concurrency control) to support transactions where only
some partition keys are known upfront, so that we may support global
secondary indexes with proper isolation and consistency.

The CEP doesn't actually mention lack of support for secondary index
queries. Probably good to add as a limitation. (I realize currently using
secondary indexes isn't mainstream in Cassandra anyway, but with SASI in
4.0 and SAI being a separate CEP in discussion, it's good to call out
Accord wouldn't automatically support them.)

While I understand they are out of scope, do you happen to have already
some idea what it would require to support secondary indexes? Is it
sufficient to just include the secondary index keys (or a range of such) in
the "deps" of the transaction? Of course, still needing to also include the
partitions or rows actuallly read as a result of scanning the secondary
index. Similarly then for mutations, deps would have to include changes to
index keys in the transaction?

*commit latency*

A topic on some off-list discussions has been to understand the
implications of using a Spanner-inspired approach where the clock skew
between cluster nodes is a necessary part of the commit latency:

Deadline(t0 ,C,P) = t0 +SkewMax +max(Latency(C′,P) |C′ ∈C)−Latency(C,P)

In the white paper you even explicitly mention the trade off you have
chosen: *"This technique trades wide area round-trips for an additional
latency penalty equal to the bounds on clock synchrony."*

If we try to quantify what this trade off means in practice, I get:

Typical value for SkewMax in e.g. the Spanner paper, some CockroachDB
discussions = 7 ms. Maybe 10 - 20 ms if you don't have Google-level
hardware.
Common network latencies in a globally distributed cluster:
US West - East = 60 ms
US East - EU Central = 100 ms
US/EU to APAC, Africa, LATAM = 100-200 ms
Source: https://www.cloudping.co/grid

The conclusion is that this tradeoff definitely makes sense for globally
distributed transactions. This resembles QUORUM writes in current Cassandra.

However, users commonly prefer LOCAL_QUORUM in current Cassandra. I read
that this was discussed in the phone call, but haven't read about a
specific proposal. Just for the sake of completing my math, let's assume
that some LOCAL_QUORUM style Accord commit is invented. A naive example
could be to simply deploy a Cassandra cluster *with Accord transactions* in
a single geographical region, and other geographical regions would be
served by some external replication mechanism and would have to be
read-only.

Whatever the (hypothetical) solution, for LOCAL_QUORUM style or just single
region commits we end up with:

Typical SkewMax = 7 - 20 ms
Network latency < 1 ms.

It seems the SkewMax is quite high for a cluster deployed in a single
region, and what's worse there's no way to avoid it or make it much smaller
than 7 ms?

The only solution that comes to mind while writing this is to design Accord
to be pluggable such that the consensus part could be switched to something
that uses a logical clock for the transaction id. The user would choose one
or the other depending on what they optimize for.

I'll finish with a few notes:

Commit latency in itself isn't categorically bad for performance. I've
worked with several implementations of distributed databases that provide
good throughput even when a single write has high latency due to
geography/speed of light.

However, the duration of a commit is the window during which other
transactions may conflict with the committing transaction. Thus commit
latency will either increase the likelihood of aborted transactions, or in
other concurrency mechanisms block and impose a max throughput for hot rows.

A known optimization for the hot rows problem is to "hint" or manually
force clients to direct all updates to the hot row to the same node,
essentially making the system leader based. This allows the database to
start processing new updates even while the first one is still committing.
(See Galera for an example implementing this
<https://galeracluster.com/library/documentation/using-sr.html#usr-hot-records>.)
This makes me wonder whether there is a similar optimization for Accord
where transactions from the same coordinator can be allowed to commit
within the SkewMax window, because we can assume that the trx timestamps
originating at the same coordinator cannot arrive out of order when using
TPC?

henrik

On Mon, Sep 27, 2021 at 11:59 PM bened...@apache.org <bened...@apache.org>
wrote:

> Ok, it’s time for the weekly poking of the hornet’s nest.
>
> Any more thoughts, questions or criticisms, anyone?
>
> From: bened...@apache.org <bened...@apache.org>
> Date: Friday, 24 September 2021 at 22:41
> To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> I’m not aware of anybody having taken any notes, but somebody please chime
> in if I’m wrong.
>
> From my recollection, re Accord:
>
>
>   *   Q: Will batches now support rollbacks?
>      *   Batches would apply atomically or not, but unlikely to have a
> concept of rollback. Timeouts remain unknown, but hope to have some
> mechanism to provide clients a definitive answer about such transactions
> after the fact.
>   *   Q: Can stale replicas participate in transactions?
>      *   Accord applies conflicting transactions in-order at every
> replica, so only nodes that are up-to-date may participate in the execution
> of a transaction, but any replica may participate in agreeing a
> transaction. To ensure replicas remain up-to-date I anticipate introducing
> a real-time repair facility at the transactional message level, with peers
> reconciling recently processed messages and cross-delivering any that are
> missing.
>   *   Possible UX directions in very vague terms: CQL atomic and
> conditional batches initially; going forwards interactive transactions?
> Complex user defined functions? SQL?
>   *   Discussed possibility of LOCAL_QUORUM reads for globally replicated
> transactional tables, as this is an important use case
>      *   Simple stale reads to transactional tables
>      *   Brainstormed a bit about serializable reads to a single DC
> without (normally) crossing WAN
>      *   Discussed possibility of multiple ACKs providing separate LAN and
> WAN persistence notifications to clients
>   *   Discussed size of fast path quorums in Accord, and how this might
> affect global latency in high RF clusters (i.e. not optimal, and in some
> cases may need every DC to participate) and how this can be modified by
> biasing fast path electorate so that 2 of the 3 DCs may reach fast-path
> decisions with each other (remaining DC having to reach both those DCs to
> reach fast path). Also discussed Calvin-like modes of operation that would
> offer optimal global latency for sufficiently small clusters at RF=3 or
> RF=5.
>
> I’m sure there were other discussions I can’t remember, perhaps others can
> fill in the blanks.
>
>
> From: Jonathan Ellis <jbel...@gmail.com>
> Date: Friday, 24 September 2021 at 20:28
> To: dev <dev@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Does anyone have notes for those of us who couldn't make the call?
>
> On Wed, Sep 22, 2021 at 1:35 PM bened...@apache.org <bened...@apache.org>
> wrote:
>
> > Hi everyone,
> >
> > Joey has helpfully arranged a call for tomorrow at 8am PST / 10am CST /
> > 4pm BST to discuss Accord and other things in the community. There are no
> > plans to make any kind of project decisions. Everyone is welcome to drop
> in
> > to discuss Accord or whatever else might be on your mind.
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__gather.town_app_2UKSboSjqKXIXliE_ac2021-2Dcass-2Dsocial&d=DwIF-g&c=adz96Xi0w1RHqtPMowiL2g&r=eYcKRCU2ISzgciHbxg_tERbSQOZMMscdGLftkLqUuXo&m=yN7Y6u6BfW9NUZaSousZnD2Y-WiBtM1xDeJNy2WEq_r-gZqFwHVT4IPaeMOUa-AF&s=cgKblfbz9lUghSPbj5Si7oM7RsZy1w9vfvWjyzL8MXs&e=
> >
> >
> > From: bened...@apache.org <bened...@apache.org>
> > Date: Wednesday, 22 September 2021 at 16:22
> > To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > No, I would expect to deliver strict serializable interactive
> transactions
> > using Accord. These would simply corroborate that the participating keys
> > had not modified their write timestamps during the final transaction.
> These
> > could even be undertaken with still only a single wide area round-trip,
> > using local copies of the data to assemble the transaction (though this
> > would marginally increase the chance of aborts)
> >
> > My goal for MVCC is parallelism, not additional isolation levels (though
> > snapshot isolation is useful and we’ll probably also want to offer that
> > eventually)
> >
> > From: Henrik Ingo <henrik.i...@datastax.com>
> > Date: Wednesday, 22 September 2021 at 15:15
> > To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > On Wed, Sep 22, 2021 at 7:56 AM bened...@apache.org <bened...@apache.org
> >
> > wrote:
> >
> > > Could you explain why you believe this trade-off is necessary? We can
> > > support full SQL just fine with Accord, and I hope that we eventually
> do
> > so.
> > >
> >
> > I assume this is really referring to interactive transactions = multiple
> > round trips to the client within a transaction.
> >
> > You mentioned previously we could later build a more MVCC like
> transaction
> > semantic on top of Accord. (Independent reads from a single snapshot,
> > followed by a commit using Accord.) In this case I think the relevant
> > discussion is whether Accord is still the optimal building block
> > performance wise to do so, or whether users would then have lower
> > consistency level but still pay the performance cost of a stricter
> > consistency level.
> >
> > henrik
> > --
> >
> > Henrik Ingo
> >
> > +358 40 569 7354 <358405697354>
> >
> > [image: Visit us online.] <https://www.datastax.com/>  [image: Visit us
> on
> > Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on
> YouTube.]
> > <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=
> > >
> >   [image: Visit my LinkedIn profile.] <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_heingo_&d=DwIF-g&c=adz96Xi0w1RHqtPMowiL2g&r=eYcKRCU2ISzgciHbxg_tERbSQOZMMscdGLftkLqUuXo&m=yN7Y6u6BfW9NUZaSousZnD2Y-WiBtM1xDeJNy2WEq_r-gZqFwHVT4IPaeMOUa-AF&s=hWWsWoR24lF18raNqjeqYEL56ZMWgN4slrOU_-RYwQg&e=
> > >
> >
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>

--

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Reply via email to