Scott, thanks for the summary.  Apparently I still haven't been successful
in communicating the kind of discussion around tradeoffs I want to have, or
maybe it comes off like I'm asking you to do my homework for me.

I'll put some more time into this, and I'll start a new thread hopefully
tomorrow.

On Thu, Oct 7, 2021 at 9:52 AM C. Scott Andreas <sc...@paradoxica.net>
wrote:

> Hi Jonathan,
>
> Following up on my message yesterday as it looks like our replies may have
> crossed en route.
>
> Thanks for bumping your message from earlier in our discussion. I believe
> we have addressed most of these questions on the thread, in addition to
> offering a presentation on this and related work at ApacheCon, a discussion
> hosted following that presentation at ApacheCon, and in ASF Slack.
> Contributors have further offered an opportuntity to discuss specific
> questions via videoconference if it helps to speak live. I'd be happy to do
> so as well.
>
> Since your original message, discussion has covered a lot of ground on the
> related databases you've mentioned:
> – Henrik has shared expertise related to MongoDB and its implementation.
> – You've shared an overview of Calvin.
> – Alex Miller has helped us review the work relative to other Paxos
> algorithms and identified a few great enhancements to incorporate.
> – The paper discusses related approaches in FoundationDB, CockroachDB, and
> Yugabyte.
> – Subsequent discussion has contrasted the implementation to DynamoDB,
> Google Cloud BigTable, and Google Cloud Spanner (noting specifically that
> the protocol achieves Spanner's 1x round-trip without requiring specialized
> hardware).
>
> In my reply yesterday, I've attempted to crystallize what becomes possible
> via CQL: one-shot multi-partition transactions in the first implementation
> and a 4x latency reduction on writes / 2x latency reduction on reads
> relative to today; along with the ability to build upon this work to enable
> interactive transactions in the future.
>
> I believe we've exercised the questions you've raised and am grateful for
> the ground we've covered. If you have further questions that are difficult
> to exercise via email, please let me know if you'd like to arrange a call
> (open-invite); we'd be happy to discuss live as well.
>
> With the proposal hitting the one-month mark, the contributors are
> interested in gauging the developer community's response to the proposal.
> We warrant our ability to focus durably on the project; execute this
> development on ASF JIRA in collaboration with other contributors; engage
> with members of the developer and user community on feedback, enhancements,
> and bugs; and intend deliver it to completion at a standard of readiness
> suitable for production transactional systems of record.
>
> Thanks,
>
> – Scott
>
> On Oct 6, 2021, at 8:25 AM, C. Scott Andreas <sc...@paradoxica.net> wrote:
>
>
>
> Hi folks,
>
> Thanks for discussion on this proposal, and also to Benedict who’s been
> fielding questions on the list!
>
> I’d like to restate the goals and problem statement captured by this
> proposal and frame context.
>
> Today, lightweight transactions limit users to transacting over a single
> partition. This unit of atomicity has a very low upper limit in terms of
> the amount of data that can be CAS’d over; and doing so leads many to
> design contorted data models to cram different types of data into one
> partition for the purposes of being able to CAS over it. We propose that
> Cassandra can and should be extended to remove this limit, enabling users
> to issue one-shot transactions that CAS over multiple keys – including CAS
> batches, which may modify multiple keys.
>
> To enable this, the CEP authors have designed a novel, leaderless
> paxos-based protocol unique to Cassandra, offered a proof of its
> correctness, a whitepaper outlining it in detail, along with a prototype
> implementation to incubate development, and integrated it with Maelstrom
> from jepsen.io to validate linearizability as more specific test
> infrastructure is developed. This rigor is remarkable, and I’m thrilled to
> see such a degree of investment in the area.
>
> Even users who do not require the capability to transact across partition
> boundaries will benefit. The protocol reduces message/WAN round-trips by 4x
> on writes (4 → 1) and 2x on reads (2 → 1) in the common case against
> today’s baseline. These latency improvements coupled with the enhanced
> flexibility of what can be transacted over in Cassandra enable new classes
> of applications to use the database.
>
> In particular, 1xRTT read/write transactions across partitions enable
> Cassandra to be thought of not just as a strongly consistent database, but
> even a transactional database - a mode many may even prefer to use by
> default. Given this capability, Apache Cassandra has an opportunity to
> become one of – or perhaps the only – database in the industry that can
> store multiple petabytes of data in a single database; replicate it across
> many regions; and allow users to transact over any subset of it. These are
> capabilities that can be met by no other system I’m aware of on the market.
> Dynamo’s transactions are single-DC. Google Cloud BigTable does not support
> transactions. Spanner, Aurora, CloudSQL, and RDS have far lower scalability
> limits or require specialized hardware, etc.
>
> This is an incredible opportunity for Apache Cassandra - to surpass the
> scalability and transactional capability of some of the most advanced
> systems in our industry - and to do so in open source, where anyone can
> download and deploy the software to achieve this without cost; and for
> students and researchers to learn from and build upon as well (a team from
> UT-Austin has already reached out to this effect).
>
> As Benedict and Blake noted, the scope of what’s captured in this proposal
> is also not terminal. While the first implementation may extend today’s CAS
> semantics to multiple partitions with lower latency, the foundation is
> suitable to build interactive transactions as well — which would be
> remarkable and is something that I hadn’t considered myself at the onset of
> this project.
>
> To that end, the CEP proposes the protocol, offers a validated
> implementation, and the initial capability of extending today’s
> single-partition transactions to multi-partition; while providing the
> flexibility to build upon this work further.
>
> A simple example of what becomes possible when this work lands and is
> integrated might be:
>
> –––
> BEGIN BATCH
> UPDATE tbl1 SET value1 = newValue1 WHERE partitionKey = k1
> UPDATE tbl2 SET value2 = newValue2 WHERE partitionKey = k2 AND conditionValue 
> = someCondition
> APPLY BATCH
> –––
>
> I understand that this query is present in the CEP and my intent isn’t to
> recommend that folks reread it if they’ve given a careful reading already.
> But I do think it’s important to elaborate upon what becomes possible when
> this query can be issued.
>
> Users of Cassandra who have designed data models that cram many types of
> data into a single partition for the purposes of atomicity no longer need
> to. They can design their applications with appropriate schemas that
> wouldn’t leave Codd holding his nose. They’re no longer pushed into
> antipatterns that result in these partitions becoming huge and potentially
> unreadable. Cassandra doesn’t become fully relational in this CEP - but it
> becomes possible and even easy to design applications that transact across
> tables that mimic a large amount of relational functionality. And for users
> who are content to transact over a single table, they’ll find those
> transactions become up to 4x faster today due to the protocol’s reduction
> in round-trips. The library’s loose coupling to Apache Cassandra and
> ability to be incubated out-of-tree also enables other applications to take
> advantage of the protocol and is a nice step toward bringing modularity to
> the project. There are a lot of good things happening here.
>
> I know I’m listed as an author - but figured I should go on record to say
> “I support this CEP.” :)
>
> Thanks,
>
> – Scott
>
> On Oct 6, 2021, at 8:05 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
>
>
> The problem that I keep pointing out is that you've created this CEP for
> Accord without first getting consensus that the goals and the tradeoffs it
> makes to achieve those goals (and that it will impose on future work around
> transactions) are the right ones for Cassandra long term.
>
> At this point I'm done repeating myself. For the convenience of anyone
> following this thread intermittently, I'll quote my first reply on this
> thread to illustrate the kind of discussion I'd like to have.
>
> -----
>
> The whitepaper here is a good description of the consensus algorithm itself
> as well as its robustness and stability characteristics, and its comparison
> with other state-of-the-art consensus algorithms is very useful. In the
> context of Cassandra, where a consensus algorithm is only part of what will
> be implemented, I'd like to see a more complete evaluation of the
> transactional side of things as well, including performance characteristics
> as well as the types of transactions that can be supported and at least a
> general idea of what it would look like applied to Cassandra. This will
> allow the PMC to make a more informed decision about what tradeoffs are
> best for the entire long-term project of first supplementing and ultimately
> replacing LWT.
>
> (Allowing users to mix LWT and AP Cassandra operations against the same
> rows was probably a mistake, so in contrast with LWT we’re not looking for
> something fast enough for occasional use but rather something within a
> reasonable factor of AP operations, appropriate to being the only way to
> interact with tables declared as such.)
>
> Besides Accord, this should cover
>
> - Calvin and FaunaDB
> - A Spanner derivative (no opinion on whether that should be Cockroach or
> Yugabyte, I don’t think it’s necessary to cover both)
> - A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
> there is more public information about MongoDB)
> - RAMP
>
> Here’s an example of what I mean:
>
> =Calvin=
>
> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
> transactions, then replicas execute the transactions independently with no
> further coordination. No SPOF. Transactions are batched by each sequencer
> to keep this from becoming a bottleneck.
>
> Performance: Calvin paper (published 2012) reports linear scaling of TPC-C
> New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
> with 7GB ram and 8 virtual cores). Note that TPC-C New Order is composed
> of four reads and four writes, so this is effectively 2M reads and 2M
> writes as we normally measure them in C*.
>
> Calvin supports mixed read/write transactions, but because the transaction
> execution logic requires knowing all partition keys in advance to ensure
> that all replicas can reproduce the same results with no coordination,
> reads against non-PK predicates must be done ahead of time (transparently,
> by the server) to determine the set of keys, and this must be retried if
> the set of rows affected is updated before the actual transaction executes.
>
> Batching and global consensus adds latency -- 100ms in the Calvin paper and
> apparently about 50ms in FaunaDB. Glass half full: all transactions
> (including multi-partition updates) are equally performant in Calvin since
> the coordination is handled up front in the sequencing step. Glass half
> empty: even single-row reads and writes have to pay the full coordination
> cost. Fauna has optimized this away for reads but I am not aware of a
> description of how they changed the design to allow this.
>
> Functionality and limitations: since the entire transaction must be known
> in advance to allow coordination-less execution at the replicas, Calvin
> cannot support interactive transactions at all. FaunaDB mitigates this by
> allowing server-side logic to be included, but a Calvin approach will never
> be able to offer SQL compatibility.
>
> Guarantees: Calvin transactions are strictly serializable. There is no
> additional complexity or performance hit to generalizing to multiple
> regions, apart from the speed of light. And since Calvin is already paying
> a batching latency penalty, this is less painful than for other systems.
>
> Application to Cassandra: B-. Distributed transactions are handled by the
> sequencing and scheduling layers, which are leaderless, and Calvin’s
> requirements for the storage layer are easily met by C*. But Calvin also
> requires a global consensus protocol and LWT is almost certainly not
> sufficiently performant, so this would require ZK or etcd (reasonable for a
> library approach but not for replacing LWT in C* itself), or an
> implementation of Accord. I don’t believe Calvin would require additional
> table-level metadata in Cassandra.
>
> On Wed, Oct 6, 2021 at 9:53 AM bened...@apache.org <bened...@apache.org>
> wrote:
>
> The problem with dropping a patch on Jira is that there is no opportunity
> to point out problems, either with the fundamental approach or with the
> specific implementation. So please point out some problems I can engage
> with!
>
>
> From: Jonathan Ellis <jbel...@gmail.com>
> Date: Wednesday, 6 October 2021 at 15:48
> To: dev <dev@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> On Wed, Oct 6, 2021 at 9:21 AM bened...@apache.org <bened...@apache.org>
> wrote:
>
> > The goals of the CEP are stated clearly, and these were the goals we had
> > going into the (multi-month) research project we undertook before
> proposing
> > this CEP. These goals are necessarily value judgements, so we cannot
> expect
> > that everyone will agree that they are optimal.
> >
>
> Right, so I'm saying that this is exactly the most important thing to get
> consensus on, and creating a CEP for a protocol to achieve goals that you
> have not discussed with the community is the CEP equivalent of dropping a
> patch on Jira without discussing its goals either.
>
> That's why our conversations haven't gone anywhere, because I keep saying
> "we need discuss the goals and tradeoffs", and I'll give an example of what
> I mean, and you keep addressing the examples (sometimes very shallowly, "it
> would be possible to X" or "Y could be done as an optimization") while
> ignoring the request to open a discussion around the big picture.
>
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
>
>
>

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Reply via email to