Re: [DISCUSS] CEP-15: General Purpose Transactions

Jonathan Ellis Wed, 29 Sep 2021 18:18:32 -0700

How are interactive transactions possible with Accord?



On Tue, Sep 21, 2021 at 11:56 PM bened...@apache.org <bened...@apache.org>
wrote:

> Could you explain why you believe this trade-off is necessary? We can
> support full SQL just fine with Accord, and I hope that we eventually do so.
>
> This domain is incredibly complex, so it is easy to reach wrong
> conclusions. I would invite you again to propose a system for discussion
> that you think offers something Accord is unable to, and that you consider
> desirable, and we can work from there.
>
> To pre-empt some possible discussions, I am not aware of anything we
> cannot do with Accord that we could do with either Calvin or Spanner.
> Interactive transactions are possible on top of Accord, as are transactions
> with an unknown read/write set. In each case the only cost is that they
> would use optimistic concurrency control, which is no worse the spanner
> derivatives anyway (which I have to assume is your benchmark in this
> regard). I do not expect to deliver either functionality initially, but
> Accord takes us most of the way there for both.
>
>
> From: Jonathan Ellis <jbel...@gmail.com>
> Date: Wednesday, 22 September 2021 at 05:36
> To: dev <dev@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Right, I'm looking for exactly a discussion on the high level goals.
> Instead of saying "here's the goals and we ruled out X because Y" we should
> start with a discussion around, "Approach A allows X and W, approach B
> allows Y and Z" and decide together what the goals should be and and what
> we are willing to trade to get those goals, e.g., are we willing to give up
> global strict serializability to get the ability to support full SQL.  Both
> of these are nice to have!
>
> On Tue, Sep 21, 2021 at 9:52 PM bened...@apache.org <bened...@apache.org>
> wrote:
>
> > Hi Jonathan,
> >
> > These other systems are incompatible with the goals of the CEP. I do
> > discuss them (besides 2PC) in both the whitepaper and the CEP, and will
> > summarise that discussion below. A true and accurate comparison of these
> > other systems is essentially intractable, as there are complex subtleties
> > to each flavour, and those who are interested would be better served by
> > performing their own research.
> >
> > I think it is more productive to focus on what we want to achieve as a
> > community. If you believe the goals of this CEP are wrong for the
> project,
> > let’s focus on that. If you want to compare and contrast specific facets
> of
> > alternative systems that you consider to be preferable in some dimension,
> > let’s do that here or in a Q&A as proposed by Joey.
> >
> > The relevant goals are that we:
> >
> >
> >   1.  Guarantee strict serializable isolation on commodity hardware
> >   2.  Scale to any cluster size
> >   3.  Achieve optimal latency
> >
> > The approach taken by Spanner derivatives is rejected by (1) because they
> > guarantee only Serializable isolation (they additionally fail (3)). From
> > watching talks by YugaByte, and inferring from Cockroach’s
> > panic-cluster-death under clock skew, this is clearly considered by
> > everyone to be undesirable but necessary to achieve scalability.
> >
> > The approach taken by FaunaDB (Calvin) is rejected by (2) because its
> > sequencing layer requires a global leader process for the cluster, which
> is
> > incompatible with Cassandra’s scalability requirements. It additionally
> > fails (3) for global clients.
> >
> > Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> > Spanner clone for its multi-key transaction functionality, not 2PC.
> >
> > Systems such as RAMP with even weaker isolation are not considered for
> the
> > simple reason that they do not even claim to meet (1).
> >
> > If we want to additionally offer weaker isolation levels than
> > Serializable, such as that provided by the recent RAMP-TAO paper,
> Cassandra
> > is likely able to support multiple distinct transaction layers that
> operate
> > independently. I would encourage you to file a CEP to explore how we can
> > meet these distinct use cases, but I consider them to be niche. I expect
> > that a majority of our user base desire strict serializable isolation,
> and
> > certainly no less than serializable isolation, to augment the existing
> > weaker isolation offered by quorum reads and writes.
> >
> > I would tangentially note that we are not an AP database under normal
> > recommended operation. A minority in any network partition cannot reach
> > QUORUM, so under recommended usage we are a high-availability leaderless
> CP
> > database.
> >
> >
> > From: Jonathan Ellis <jbel...@gmail.com>
> > Date: Tuesday, 21 September 2021 at 23:45
> > To: dev <dev@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > Benedict, thanks for taking the lead in putting this together. Since
> > Cassandra is the only relevant database today designed around a
> leaderless
> > architecture, it's quite likely that we'll be better served with a custom
> > transaction design instead of trying to retrofit one from CP systems.
> >
> > The whitepaper here is a good description of the consensus algorithm
> itself
> > as well as its robustness and stability characteristics, and its
> comparison
> > with other state-of-the-art consensus algorithms is very useful.  In the
> > context of Cassandra, where a consensus algorithm is only part of what
> will
> > be implemented, I'd like to see a more complete evaluation of the
> > transactional side of things as well, including performance
> characteristics
> > as well as the types of transactions that can be supported and at least a
> > general idea of what it would look like applied to Cassandra. This will
> > allow the PMC to make a more informed decision about what tradeoffs are
> > best for the entire long-term project of first supplementing and
> ultimately
> > replacing LWT.
> >
> > (Allowing users to mix LWT and AP Cassandra operations against the same
> > rows was probably a mistake, so in contrast with LWT we’re not looking
> for
> > something fast enough for occasional use but rather something within a
> > reasonable factor of AP operations, appropriate to being the only way to
> > interact with tables declared as such.)
> >
> > Besides Accord, this should cover
> >
> > - Calvin and FaunaDB
> > - A Spanner derivative (no opinion on whether that should be Cockroach or
> > Yugabyte, I don’t think it’s necessary to cover both)
> > - A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
> > there is more public information about MongoDB)
> > - RAMP
> >
> > Here’s an example of what I mean:
> >
> > =Calvin=
> >
> > Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
> > transactions, then replicas execute the transactions independently with
> no
> > further coordination.  No SPOF.  Transactions are batched by each
> sequencer
> > to keep this from becoming a bottleneck.
> >
> > Performance: Calvin paper (published 2012) reports linear scaling of
> TPC-C
> > New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
> > with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is composed
> > of four reads and four writes, so this is effectively 2M reads and 2M
> > writes as we normally measure them in C*.
> >
> > Calvin supports mixed read/write transactions, but because the
> transaction
> > execution logic requires knowing all partition keys in advance to ensure
> > that all replicas can reproduce the same results with no coordination,
> > reads against non-PK predicates must be done ahead of time
> (transparently,
> > by the server) to determine the set of keys, and this must be retried if
> > the set of rows affected is updated before the actual transaction
> executes.
> >
> > Batching and global consensus adds latency -- 100ms in the Calvin paper
> and
> > apparently about 50ms in FaunaDB.  Glass half full: all transactions
> > (including multi-partition updates) are equally performant in Calvin
> since
> > the coordination is handled up front in the sequencing step.  Glass half
> > empty: even single-row reads and writes have to pay the full coordination
> > cost.  Fauna has optimized this away for reads but I am not aware of a
> > description of how they changed the design to allow this.
> >
> > Functionality and limitations: since the entire transaction must be known
> > in advance to allow coordination-less execution at the replicas, Calvin
> > cannot support interactive transactions at all.  FaunaDB mitigates this
> by
> > allowing server-side logic to be included, but a Calvin approach will
> never
> > be able to offer SQL compatibility.
> >
> > Guarantees: Calvin transactions are strictly serializable.  There is no
> > additional complexity or performance hit to generalizing to multiple
> > regions, apart from the speed of light.  And since Calvin is already
> paying
> > a batching latency penalty, this is less painful than for other systems.
> >
> > Application to Cassandra: B-.  Distributed transactions are handled by
> the
> > sequencing and scheduling layers, which are leaderless, and Calvin’s
> > requirements for the storage layer are easily met by C*.  But Calvin also
> > requires a global consensus protocol and LWT is almost certainly not
> > sufficiently performant, so this would require ZK or etcd (reasonable
> for a
> > library approach but not for replacing LWT in C* itself), or an
> > implementation of Accord.  I don’t believe Calvin would require
> additional
> > table-level metadata in Cassandra.
> >
> > On Sun, Sep 5, 2021 at 9:33 AM bened...@apache.org <bened...@apache.org>
> > wrote:
> >
> > > Wiki:
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > Whitepaper:
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > >
> > > Prototype: https://github.com/belliottsmith/accord
> > >
> > > Hi everyone, I’d like to propose this CEP for adoption by the
> community.
> > >
> > > Cassandra has benefitted from LWTs for many years, but application
> > > developers that want to ensure consistency for complex operations must
> > > either accept the scalability bottleneck of serializing all related
> state
> > > through a single partition, or layer a complex state machine on top of
> > the
> > > database. These are sophisticated and costly activities that our users
> > > should not be expected to undertake. Since distributed databases are
> > > beginning to offer distributed transactions with fewer caveats, it is
> > past
> > > time for Cassandra to do so as well.
> > >
> > > This CEP proposes the use of several novel techniques that build upon
> > > research (that followed EPaxos) to deliver (non-interactive) general
> > > purpose distributed transactions. The approach is outlined in the
> > wikipage
> > > and in more detail in the linked whitepaper. Importantly, by adopting
> > this
> > > approach we will be the _only_ distributed database to offer global,
> > > scalable, strict serializable transactions in one wide area round-trip.
> > > This would represent a significant improvement in the state of the art,
> > > both in the academic literature and in commercial or open source
> > offerings.
> > >
> > > This work has been partially realised in a prototype. This partial
> > > prototype has been verified against Jepsen.io’s Maelstrom library and
> > > dedicated in-tree strict serializability verification tools, but much
> > work
> > > remains for the work to be production capable and integrated into
> > Cassandra.
> > >
> > > I propose including the prototype in the project as a new source
> > > repository, to be developed as a standalone library for integration
> into
> > > Cassandra. I hope the community sees the important value proposition of
> > > this proposal, and will adopt the CEP after this discussion, so that
> the
> > > library and its integration into Cassandra can be developed in parallel
> > and
> > > with the involvement of the wider community.
> > >
> >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
> >
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Reply via email to