Re: Tradeoffs for Cassandra transaction management

bened...@apache.org Mon, 11 Oct 2021 15:10:43 -0700

Hi Jonathan,

I would appreciate it if you would respond to all of my email(s), as (at your 
insistence) I spend a great deal of time responding to you. Cherry-picking 
makes these conversations very difficult.


If we want to fully unpack this particular point, as far as I can tell claiming 
ANSI SQL would indeed require interactive transactions in which arbitrary 
conditional work may be performed by a client within a transaction in response 
to other actions within that transaction.

However:

  1.  The ANSI SQL standard permits these transactions to fail and rollback 
(e.g. in the event that your optimistic transaction fails). So if you want to 
be pedantic, you may modify my statement to “SQL does not necessitate support 
for abort-free interactive transactions” and we can leave it there.
  2.  I would personally consider “SQL support” to include the capability of 
defining arbitrary SQL stored procedures that may be executed by clients in an 
interactive session, or interactive sessions where the client must submit 
transactional scripts that may be arbitrarily complex and contingent on prior 
responses, but where each script must be executed within its own transaction. 
For many use cases this would constitute SQL support (and, indeed, I think 
cover every SQL use case in my career).
  3.  Most importantly, as I pointed out in the previous email, Accord is 
compatible with a YugaByte/Cockroach-like approach, and indeed makes this 
approach both easier to accomplish and enables stronger isolation than the 
equivalent Raft-based approach. These approaches are able to reduce the number 
of conflicts, at a cost of significantly higher transaction management burden.

In summary we have all options on the table. Not only does CEP-15 not close any 
doors, it brings them all a step closer. If you have a strong opinion about 
which (if any) of these approaches we pursue post CEP-15, I would love to have 
this conversation. However, this should not block the adoption of CEP-15, since 
they are not in conflict.


From: Jonathan Ellis <jbel...@gmail.com>
Date: Monday, 11 October 2021 at 22:20
To: dev <dev@cassandra.apache.org>
Subject: Re: Tradeoffs for Cassandra transaction management
Hi Benedict,

Yes, interactive transactions are a necessary part of SQL support (as
opposed to a tiny subset of SQL that matches CQL semantics, I don't know
any other way to make sense of your claim that "SQL does not necessitate
support for interactive transactions").

I still don't understand how you're saying we could implement interactive
transactions on top of a deterministic transaction manager.  In the other
thread you said that "Interactive transactions are possible on top of
Accord, as are transactions with an unknown read/write set. In each case
the only cost is that they would use optimistic concurrency control, which
is no worse than spanner derivatives anyway" but this is not correct,
interactive transactions are substantially more difficult to support than
transactions with unknown read/write set, as I outlined in the email to
kick off this thread.

On Sun, Oct 10, 2021 at 4:05 AM bened...@apache.org <bened...@apache.org>
wrote:

> Hi Jonathan,
>
> I will summarise my position below, that I have outlined at various points
> in the other thread, and then I would be interested to hear how you propose
> we move forwards. I will commit to responding the same day to any email I
> receive before 7pm GMT, and to engaging with each of your points. I would
> appreciate it if you could make similar commitments so that we may conclude
> this discussion in a reasonable time frame and conduct a vote on CEP-15.
>
> I also reiterate my standing invitation to an open video chat, to discuss
> anything you like, for as long as you like. Please nominate a suitable time
> and day.
>
> ==TL;DR==
> CEP-15 does not narrow our future options, it only broadens them. Accord
> is a distributed consensus protocol, so these techniques may build upon it
> without penalty. Alternatively, these approaches may simply live alongside
> Accord.
>
> Since these alternative approaches do not achieve the goals of the CEP,
> and this CEP only enhances your ability to pursue them, it seems hard to
> conclude it should not proceed.
>
> ==Goals==
> Our goals are first order principles: we want strict serializable
> cross-shard isolation that is highly available and can be scaled while
> maintaining optimal and predictable latency. Anything less, and the CEP is
> not achieved.
>
> As outlined already (except SLOG, which I address below), these
> alternative approaches do not achieve these goals.
>
> ==Compatibility with other approaches==
> 0. In general, research systems are not irreducible - they are an assembly
> of ideas that can be mixed together. Accord is a distributed consensus
> protocol. These other protocols may utilise it without penalty for
> consensus, in many cases obtaining improved characteristics. Conversely,
> Accord may itself directly integrate some of these ideas.
>
> 1. Cockroach, YugaByte, Dynamo et al utilize read and write intents, the
> same as outlined as a technique for interactive transactions with Accord.
> They manage these in a distributed state machine with per-shard consensus,
> permitting them to achieve serializable isolation. This same technique can
> be used with Accord, with the advantage that strict serializable isolation
> would be achievable. For simple transactions we would be able to execute
> with “pure” Accord and retain its execution advantage. Accord does not
> disadvantage this approach, it is only enhanced and made easier.
>
> 2. Calvin: Accord is broadly functionally equivalent, only leaderless,
> thereby achieving better global latency properties.
>
> 3. SLOG: This is essentially Calvin. The main modification is that we may
> assign data a home region, so that transactions may be faster if they
> participate in just one region, and slower if they involve multiple
> regions. Note that this protocol does not achieve global serializability
> without either losing consistency or availability under network partition
> or paying a WAN cost.
>
> In its consistent mode SLOG therefore remains slower than Accord for both
> single-home and multi-home transactions. Accord requires one WAN penalty
> for linearizing a transaction (competing transactions pay this cost
> simultaneously, as with SLOG), however this is achieved for global clients,
> whereas SLOG must cross the WAN multiple times for transactions initiated
> from outside their home, and for all multi-home transactions.
>
> As discussed elsewhere, a future optimisation with Accord is to
> temporarily “home” competing transaction for execution only, so that there
> is no additional WAN penalty when executing competing transactions. This
> would confer the same performance advantages as SLOG, without any of its
> penalties for multi-home transactions or heterogenous latency
> characteristics, nor any of the complexities of re-homing data, thus
> avoiding these unpredictable performance characteristics.
>
> For those use cases that do not require high availability, it would be
> possible to implement a “home” region setup with Accord, as with SLOG. This
> is not an idea that is exclusive to this particular system. We even
> discussed this briefly in the call, as some use cases do indeed prefer this
> trade-off.
>
> SLOG additionally offers a kind of “home group” multi-home optimisation
> for clusters with many regions, that accept availability loss if fewer than
> half of their regions fail (e.g. in the paper 6 regions in pairs of 2 for
> availability). This is also exploitable by Accord, and something we can
> pursue as a future optimisation, as users explore such topologies in the
> real world.
>
> ==Responding to specific points==
>
> >because it was asserted in the CEP-15 thread that Accord could support
> SQL by applying known techniques on top. This is mistaken. Deterministic
> systems like Calvin or SLOG or Accord can support queries where the rows
> affected are not known in advance using a technique that Abadi calls OLLP
>
> Language is hard and it is easy to conflate things. Here you seem to be
> discussing abort-free interactive transactions, not SQL. SQL does not
> necessitate support for interactive transactions, let alone abort-free
> ones. The technique you mention can support SQL scripts, and also
> interactive client transactions that may be aborted by the server. However,
> see [1] which may support all of these properties.
>
>
>
> From: Blake Eggleston <beggles...@apple.com.INVALID>
> Date: Sunday, 10 October 2021 at 05:17
> To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> Subject: Re: Tradeoffs for Cassandra transaction management
> 1. Is it worth giving up local latencies to get full global consistency?
> Most LWT use cases use
> LOCAL_SERIAL.
>
> This isn’t a tradeoff that needs to be made. There’s nothing about Accord
> that prevents performing consensus in one DC and replicating the writes to
> others. That’s not in scope for the initial work, but there’s no reason it
> couldn’t be handled as a follow on if needed. I agree with Jeff that
> LOCAL_SERIAL and LWTs are not usually done with a full understanding of the
> implications, but there are some valid use cases. For instance, you can
> enable an OLAP service to operate against another DC without impacting the
> primary, assuming the service can tolerate inconsistency for data written
> since the last repair, and there are some others.
>
> 2. Is it worth giving up the possibility of SQL support, to get the
> benefits of deterministic transaction design?
>
> This is a false dilemma. Today, we’re proposing a deterministic
> transaction design that addresses some very common user pain points. SQL
> addresses different user pain point. If someone wants to add an sql
> implementation in the future they can a) build it on top of accord b)
> extend or improve accord or c) implement a separate system. The right
> choice will depend on their goals, but accord won’t prevent work on it, the
> same way the original lwt design isn’t preventing work on multi-partition
> transactions. In the worst case, if the goals of a hypothetical sql project
> are different enough to make them incompatible with accord, I don’t see any
> reason why we couldn’t have 2 separate consensus systems, so long as people
> are willing to maintain them and the use cases and available technologies
> justify it.
>
> -Blake
>
> > On Oct 9, 2021, at 9:54 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
> >
> > * Hi all,After calling several times for a broader discussion of goals
> and
> > tradeoffs around transaction management in the CEP-15 thread, I’ve put
> > together a short analysis to kick that off.Here is a table that
> summarizes
> > the state of the art for distributed transactions that offer
> > serializability, i.e., a superset of what you can get with LWT.  (The
> most
> > interesting option that this eliminates is RAMP.)Since I'm not sure how
> > this will render outside gmail, I've also uploaded it here:
> > https://imgur.com/a/SCZ8jex
> > <https://imgur.com/a/SCZ8jex>SpannerCockroachCalvin/FaunaSLOG (see
> > below)Write latencyGlobal Paxos, plus 2pc for multi-partition.For
> > intercontinental replication this is 100+ms.  Cloud Spanner does not
> allow
> > truly global deployments for this reason.Single-region Paxos, plus 2pc.
> > I’m not very clear on how this works but it results in non-strict
> > serializability.I didn’t find actual numbers for CR other than “2ms in a
> > single AZ” which is not a typical scenario.Global Raft.  Fauna posts
> actual
> > numbers of ~70ms in production which I assume corresponds to a
> multi-region
> > deployment with all regions in the USA.  SLOG paper says true global
> Calvin
> > is 200+ms.Single-region Paxos (common case) with fallback to multi-region
> > Paxos.Under 10ms.Scalability bottlenecksLocks held during cross-region
> > replicationSame as SpannerOLLP approach required when PKs are not known
> in
> > advance (mostly for indexed queries) -- results in retries under
> > contentionSame as CalvinRead latency at serial consistencyTimestamp from
> > Paxos leader (may be cross-region), then read from local replica.Same as
> > Spanner, I thinkSame as writesSame as writesMaximum serializability
> > flavorStrictUn-strictStrictStrictSupport for other isolation
> > levels?SnapshotNoSnapshot (in Fauna)Paper mentions dropping from
> > strict-serializable to only serializable.  Probably could also support
> > Snapshot like Fauna.Interactive transaction support (req’d for
> > SQL)YesYesNoNoPotential for grafting onto C*NightmareNightmareReasonable,
> > Calvin is relatively simple and the storage assumptions it makes are
> > minimalI haven’t thought about this enough. SLOG may require versioned
> > storage, e.g. see this comment
> > <
> http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html?showComment=1570497003296#c5976719429355924873
> >.(I
> > have not included Accord here because it’s not sufficiently clear to me
> how
> > to create a full transaction manager from the Accord protocol, so I can’t
> > analyze many of the properties such a system would have.  The most
> obvious
> > solution would be “Calvin but with Accord instead of Raft”, but since
> > Accord already does some Calvin-like things that seems like it would
> result
> > in some suboptimal redundancy.)After putting the above together it seems
> to
> > me that the two main areas of tradeoff are, 1. Is it worth giving up
> local
> > latencies to get full global consistency?  Most LWT use cases use
> > LOCAL_SERIAL.  While all of the above have more efficient designs than
> LWT,
> > it’s still true that global serialization will require 100+ms in the
> > general case due to physical transmission latency.  So a design that
> allows
> > local serialization with EC between regions, or a design (like SLOG) that
> > automatically infers a “home” region that can do local consensus in the
> > common case without giving up global serializability, is desirable.2. Is
> it
> > worth giving up the possibility of SQL support, to get the benefits of
> > deterministic transaction design?  To be clear, these benefits include
> very
> > significant ones around simplicity of design, higher write throughput,
> and
> > (in SLOG) lower read and write latencies.I’ll doubleclick on #2 because
> it
> > was asserted in the CEP-15 thread that Accord could support SQL by
> applying
> > known techniques on top.  This is mistaken.  Deterministic systems like
> > Calvin or SLOG or Accord can support queries where the rows affected are
> > not known in advance using a technique that Abadi calls OLLP (Optimistic
> > Lock Location Prediction), but this does not help when the transaction
> > logic is not known in advance.Here is Daniel Abadi’s explanation of OLLP
> > from “An Overview of Deterministic Database Systems
> > <
> https://cacm.acm.org/magazines/2018/9/230601-an-overview-of-deterministic-database-systems/fulltext?mobile=false
> >:”In
> > practice, deterministic database systems that use ordered locking do not
> > wait until runtime for transactions to determine their access-sets.
> > Instead, they use a technique called OLLP where if a transaction does not
> > know its access-sets in advance, it is not inserted into the input log.
> > Instead, it is run in a trial mode that does not write to the database
> > state, but determines what it would have read or written to if it was
> > actually being processed. It is then annotated with the access-sets
> > determined during the trial run, and submitted to the input log for
> actual
> > processing. In the actual run, every replica processes the transaction
> > deterministically, acquiring locks for the transaction based on the
> > estimate from the trial run. In some cases, database state may have
> changed
> > in a way that the access sets estimates are now incorrect. Since a
> > transaction cannot read or write data for which it does not have a lock,
> it
> > must abort as soon as it realizes that it acquired the wrong set of
> locks.
> > But since the transaction is being processed deterministically at this
> > point, every replica will independently come to the same conclusion that
> > the wrong set of locks were acquired, and will all independently decide
> to
> > abort the transaction. The transaction then gets resubmitted to the input
> > log with the new access-set estimates annotated.Clearly this does not
> work
> > if the server-visible logic changes between runs.  For instance, consider
> > this simple interactive transaction:cursor.execute("BEGIN
> > TRANSACTION")count = cursor.execute("SELECT count FROM inventory WHERE
> id =
> > 1").result[0]if count > 0:    cursor.execute("UPDATE inventory SET count
> =
> > count - 1 WHERE id = 1")cursor.execute("COMMIT TRANSACTION")The first
> > problem is that it’s far from clear how to do a “trial run” of a
> > transaction that the server only knows pieces of at a time.  But even
> > worse, the server only knows that it got either a SELECT, or a SELECT
> > followed by an UPDATE.  It doesn’t know anything about the logic that
> would
> > drive a change in those statements.  So if the value read changes between
> > trial run and execution, there is no possibility of transparently
> retrying,
> > you’re just screwed and have to report failure.So Abadi concludes,[A]ll
> > recent [deterministic database] implementations have limited or no
> support
> > for interactive transactions, thereby preventing their use in many
> existing
> > deployments. If the advantages of deterministic database systems will be
> > realized in the coming years, one of two things must occur: either
> database
> > users must accept a stored procedure interface to the system [instead of
> > client-side SQL], or additional research must be performed in order to
> > enable improved support for interactive transactions.TLDR:We need to
> decide
> > if we want to give users local transaction latencies, either with an
> > approach inspired by SLOG or with tuneable serializability like LWT
> > (trading away global consistency).  I think the answer here is clearly
> Yes,
> > we have abundant evidence from LWT that people care a great deal about
> > latency, and specifically that they are willing to live with
> > cross-datacenter eventual consistency to get low local latencies.We also
> > need to decide if we eventually want to support full SQL.  I think this
> one
> > is less clear, there are strong arguments both ways.P.S. SLOG deserves
> more
> > attention. Here are links to the paper
> > <http://www.vldb.org/pvldb/vol12/p1747-ren.pdf>, Abadi’s writeup
> > <
> http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html
> >,
> > and Murat Demirbas’s reading group compares SLOG to something called
> Ocean
> > Vista that I’ve never heard of but which reminds me of Accord
> > <
> http://muratbuffalo.blogspot.com/2020/11/ocean-vista-gossip-based-visibility.html
> >.*
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Tradeoffs for Cassandra transaction management

Reply via email to