How are interactive transactions possible with Accord?
On Tue, Sep 21, 2021 at 11:56 PM bened...@apache.org <bened...@apache.org> wrote: > Could you explain why you believe this trade-off is necessary? We can > support full SQL just fine with Accord, and I hope that we eventually do so. > > This domain is incredibly complex, so it is easy to reach wrong > conclusions. I would invite you again to propose a system for discussion > that you think offers something Accord is unable to, and that you consider > desirable, and we can work from there. > > To pre-empt some possible discussions, I am not aware of anything we > cannot do with Accord that we could do with either Calvin or Spanner. > Interactive transactions are possible on top of Accord, as are transactions > with an unknown read/write set. In each case the only cost is that they > would use optimistic concurrency control, which is no worse the spanner > derivatives anyway (which I have to assume is your benchmark in this > regard). I do not expect to deliver either functionality initially, but > Accord takes us most of the way there for both. > > > From: Jonathan Ellis <jbel...@gmail.com> > Date: Wednesday, 22 September 2021 at 05:36 > To: dev <dev@cassandra.apache.org> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions > Right, I'm looking for exactly a discussion on the high level goals. > Instead of saying "here's the goals and we ruled out X because Y" we should > start with a discussion around, "Approach A allows X and W, approach B > allows Y and Z" and decide together what the goals should be and and what > we are willing to trade to get those goals, e.g., are we willing to give up > global strict serializability to get the ability to support full SQL. Both > of these are nice to have! > > On Tue, Sep 21, 2021 at 9:52 PM bened...@apache.org <bened...@apache.org> > wrote: > > > Hi Jonathan, > > > > These other systems are incompatible with the goals of the CEP. I do > > discuss them (besides 2PC) in both the whitepaper and the CEP, and will > > summarise that discussion below. A true and accurate comparison of these > > other systems is essentially intractable, as there are complex subtleties > > to each flavour, and those who are interested would be better served by > > performing their own research. > > > > I think it is more productive to focus on what we want to achieve as a > > community. If you believe the goals of this CEP are wrong for the > project, > > let’s focus on that. If you want to compare and contrast specific facets > of > > alternative systems that you consider to be preferable in some dimension, > > let’s do that here or in a Q&A as proposed by Joey. > > > > The relevant goals are that we: > > > > > > 1. Guarantee strict serializable isolation on commodity hardware > > 2. Scale to any cluster size > > 3. Achieve optimal latency > > > > The approach taken by Spanner derivatives is rejected by (1) because they > > guarantee only Serializable isolation (they additionally fail (3)). From > > watching talks by YugaByte, and inferring from Cockroach’s > > panic-cluster-death under clock skew, this is clearly considered by > > everyone to be undesirable but necessary to achieve scalability. > > > > The approach taken by FaunaDB (Calvin) is rejected by (2) because its > > sequencing layer requires a global leader process for the cluster, which > is > > incompatible with Cassandra’s scalability requirements. It additionally > > fails (3) for global clients. > > > > Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a > > Spanner clone for its multi-key transaction functionality, not 2PC. > > > > Systems such as RAMP with even weaker isolation are not considered for > the > > simple reason that they do not even claim to meet (1). > > > > If we want to additionally offer weaker isolation levels than > > Serializable, such as that provided by the recent RAMP-TAO paper, > Cassandra > > is likely able to support multiple distinct transaction layers that > operate > > independently. I would encourage you to file a CEP to explore how we can > > meet these distinct use cases, but I consider them to be niche. I expect > > that a majority of our user base desire strict serializable isolation, > and > > certainly no less than serializable isolation, to augment the existing > > weaker isolation offered by quorum reads and writes. > > > > I would tangentially note that we are not an AP database under normal > > recommended operation. A minority in any network partition cannot reach > > QUORUM, so under recommended usage we are a high-availability leaderless > CP > > database. > > > > > > From: Jonathan Ellis <jbel...@gmail.com> > > Date: Tuesday, 21 September 2021 at 23:45 > > To: dev <dev@cassandra.apache.org> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions > > Benedict, thanks for taking the lead in putting this together. Since > > Cassandra is the only relevant database today designed around a > leaderless > > architecture, it's quite likely that we'll be better served with a custom > > transaction design instead of trying to retrofit one from CP systems. > > > > The whitepaper here is a good description of the consensus algorithm > itself > > as well as its robustness and stability characteristics, and its > comparison > > with other state-of-the-art consensus algorithms is very useful. In the > > context of Cassandra, where a consensus algorithm is only part of what > will > > be implemented, I'd like to see a more complete evaluation of the > > transactional side of things as well, including performance > characteristics > > as well as the types of transactions that can be supported and at least a > > general idea of what it would look like applied to Cassandra. This will > > allow the PMC to make a more informed decision about what tradeoffs are > > best for the entire long-term project of first supplementing and > ultimately > > replacing LWT. > > > > (Allowing users to mix LWT and AP Cassandra operations against the same > > rows was probably a mistake, so in contrast with LWT we’re not looking > for > > something fast enough for occasional use but rather something within a > > reasonable factor of AP operations, appropriate to being the only way to > > interact with tables declared as such.) > > > > Besides Accord, this should cover > > > > - Calvin and FaunaDB > > - A Spanner derivative (no opinion on whether that should be Cockroach or > > Yugabyte, I don’t think it’s necessary to cover both) > > - A 2PC implementation (the Accord paper mentions DynamoDB but I suspect > > there is more public information about MongoDB) > > - RAMP > > > > Here’s an example of what I mean: > > > > =Calvin= > > > > Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order > > transactions, then replicas execute the transactions independently with > no > > further coordination. No SPOF. Transactions are batched by each > sequencer > > to keep this from becoming a bottleneck. > > > > Performance: Calvin paper (published 2012) reports linear scaling of > TPC-C > > New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines > > with 7GB ram and 8 virtual cores). Note that TPC-C New Order is composed > > of four reads and four writes, so this is effectively 2M reads and 2M > > writes as we normally measure them in C*. > > > > Calvin supports mixed read/write transactions, but because the > transaction > > execution logic requires knowing all partition keys in advance to ensure > > that all replicas can reproduce the same results with no coordination, > > reads against non-PK predicates must be done ahead of time > (transparently, > > by the server) to determine the set of keys, and this must be retried if > > the set of rows affected is updated before the actual transaction > executes. > > > > Batching and global consensus adds latency -- 100ms in the Calvin paper > and > > apparently about 50ms in FaunaDB. Glass half full: all transactions > > (including multi-partition updates) are equally performant in Calvin > since > > the coordination is handled up front in the sequencing step. Glass half > > empty: even single-row reads and writes have to pay the full coordination > > cost. Fauna has optimized this away for reads but I am not aware of a > > description of how they changed the design to allow this. > > > > Functionality and limitations: since the entire transaction must be known > > in advance to allow coordination-less execution at the replicas, Calvin > > cannot support interactive transactions at all. FaunaDB mitigates this > by > > allowing server-side logic to be included, but a Calvin approach will > never > > be able to offer SQL compatibility. > > > > Guarantees: Calvin transactions are strictly serializable. There is no > > additional complexity or performance hit to generalizing to multiple > > regions, apart from the speed of light. And since Calvin is already > paying > > a batching latency penalty, this is less painful than for other systems. > > > > Application to Cassandra: B-. Distributed transactions are handled by > the > > sequencing and scheduling layers, which are leaderless, and Calvin’s > > requirements for the storage layer are easily met by C*. But Calvin also > > requires a global consensus protocol and LWT is almost certainly not > > sufficiently performant, so this would require ZK or etcd (reasonable > for a > > library approach but not for replacing LWT in C* itself), or an > > implementation of Accord. I don’t believe Calvin would require > additional > > table-level metadata in Cassandra. > > > > On Sun, Sep 5, 2021 at 9:33 AM bened...@apache.org <bened...@apache.org> > > wrote: > > > > > Wiki: > > > > > > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions > > > Whitepaper: > > > > > > https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf > > > < > > > > > > https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2 > > > > > > > Prototype: https://github.com/belliottsmith/accord > > > > > > Hi everyone, I’d like to propose this CEP for adoption by the > community. > > > > > > Cassandra has benefitted from LWTs for many years, but application > > > developers that want to ensure consistency for complex operations must > > > either accept the scalability bottleneck of serializing all related > state > > > through a single partition, or layer a complex state machine on top of > > the > > > database. These are sophisticated and costly activities that our users > > > should not be expected to undertake. Since distributed databases are > > > beginning to offer distributed transactions with fewer caveats, it is > > past > > > time for Cassandra to do so as well. > > > > > > This CEP proposes the use of several novel techniques that build upon > > > research (that followed EPaxos) to deliver (non-interactive) general > > > purpose distributed transactions. The approach is outlined in the > > wikipage > > > and in more detail in the linked whitepaper. Importantly, by adopting > > this > > > approach we will be the _only_ distributed database to offer global, > > > scalable, strict serializable transactions in one wide area round-trip. > > > This would represent a significant improvement in the state of the art, > > > both in the academic literature and in commercial or open source > > offerings. > > > > > > This work has been partially realised in a prototype. This partial > > > prototype has been verified against Jepsen.io’s Maelstrom library and > > > dedicated in-tree strict serializability verification tools, but much > > work > > > remains for the work to be production capable and integrated into > > Cassandra. > > > > > > I propose including the prototype in the project as a new source > > > repository, to be developed as a standalone library for integration > into > > > Cassandra. I hope the community sees the important value proposition of > > > this proposal, and will adopt the CEP after this discussion, so that > the > > > library and its integration into Cassandra can be developed in parallel > > and > > > with the involvement of the wider community. > > > > > > > > > -- > > Jonathan Ellis > > co-founder, http://www.datastax.com > > @spyced > > > > > -- > Jonathan Ellis > co-founder, http://www.datastax.com > @spyced > -- Jonathan Ellis co-founder, http://www.datastax.com @spyced