On Tue, Oct 12, 2021 at 3:55 PM Henrik Ingo <henrik.i...@datastax.com> wrote: > We define READ COMMITTED as "whatever is returned by Cassandra when > executing the query (with QUORUM consistency)". In other words, this > functionality doesn't require any changes to the storage engine or other > fundamental changes to Cassandra. The Accord commit is guaranteed to > succeed per design and the READ COMMITTED transaction doesn't add any > additional checks for conflicts. As such, this functionality remains > abort-free. > [snip] > Future work: A motivation for the above proposal is that the same scheme > could be extended to support SNAPSHOT ISOLATION transactions. This would > require MVCC support from the storage engine.
These two pieces together seem to imply that your claim is that Read Committed may read whatever the most recently committed data during the execution of the statement and does not require MVCC. Though I agree that the standard[1] is very unclear as to what a "read" means when defining a non-repeatable read: > 2) P2 ("Non-repeatable read"): SQL-transaction T1 reads a row. SQL- > transaction T2 then modifies or deletes that row and performs > a COMMIT. If T1 then attempts to reread the row, it may receive > the modified value or discover that the row has been deleted. The common implementation is that Read Committed reads from a snapshot of the database state. The documentation of various database implementations are much more clear about this. See, for example, Oracle[2] or MySQL[3] on the subject. So I believe Read Committed would also require MVCC support in the storage engine the same way that Snapshot Isolation would. [1]: https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt [2]: https://docs.oracle.com/cd/E25054_01/server.1111/e25789/consist.htm , see section "Read Consistency in the Read Committed Isolation Level" [3]: https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html On Tue, Oct 12, 2021 at 3:55 PM Henrik Ingo <henrik.i...@datastax.com> wrote: > Approach: The conversational part of the transaction is a sequence of > regular Cassandra reads and writes. Mutations are however executed as > read-queries toward the database nodes. Database state isn't modified > during the conversational phase, rather the primary keys of the > to-be-mutated rows are stored for later use. Accord is essentially the > commit phase of the transaction. All primary keys to be updated are the > write set of the Accord transaction. There's no need to re-execute the > reads, so the read set is empty. As I've pondered this over time, I personally specifically fault read-your-uncommitted-writes as the reason why NewSQL databases are essentially a design monoculture. Every database persists uncommitted writes in the database itself during execution. Doing so encourages those writes to be re-used for concurrency control (ie. write intents), and then that places you in the exact client-driven 3PC protocol which Cockroach, TiDB, and YugaByte all implement. Even if you find some radically different database for SQL, like leanXcale, it _still_ persists uncommitted writes into the database. And every time I've thought through this, I tend to agree. It's too exceedingly easy to write a SQL query which will exceed any local limit imposed by memory, and it's too easy to write a query which runs fine in production for a while, until it hits a memory limit and begins to immediately fail. There's a tremendous implementation difference between `DELETE FROM Table` and `TRUNCATE Table`, and it's relatively hard to explain that to naive users. Memory constraints aside, merging a local write cache into remote data for execution seems like it'd be quite a task. Any desire for efficient distributed query execution would push for a design where query fragments can be pushed down to the nodes holding the data. I imagine that one would then need to distribute all writes out to each partition along with the query fragment for them to execute, so that they can merge the pending writes in with the existing data, but such a solution also places a significant overhead burden on the database. Clients need to resend all potentially relevant writes to servers on each statement, and servers need to hold all of a client’s writes in memory as they execute. The alternative of trying to very calculate a subset of the result affected by the local writes and union it into the query executed without the local writes feels prohibitively complex. Both directions seem fraught with peril. But the write intent approach makes this conveniently easy, as any server that has a row of data, also has all the uncommitted rows from currently in progress transactions, and thus can easily filter to the correct row as part of its MVCC implementation. Faced with these two options, I understand why the world has chosen that write intents are a great solution, but the monoculture does make me a bit sad. On Tue, Oct 12, 2021 at 5:20 PM Jonathan Ellis <jbel...@gmail.com> wrote: > without having the entire logic of the transaction available to it, the server > cannot retry the transaction when concurrent changes are found to have > been applied after the reconnaissance reads (what you call the > conversational phase). This is not true at Read Committed! The major advantage of Read Committed over Repeatable Read is that by giving up a consistent read timestamp from statement to statement, you give the server the ability to retry your statement's execution multiple times, and wait out conflicting transactions between each attempt. Entire transactions don't need to be retried when a conflict is encountered, only the statement itself which encountered the conflict. This is largely pedantic nitpicking, as your question applied as Repeatable Read and above, server-side retry is an important (and expected) Read Committed optimization! But more related to your point, suppose an (Accord) transaction attempts to re-validate its reconnaissance reads, fails, and aborts. The client that submitted the transaction then notices that its transaction failed, and re-runs reconnaissance, and re-submits the transaction. It does so in a loop until one of its transactions report successfully executing/applying. Do you consider that an interactive transaction? If so, then I would wish to clarify that this isn't a question of _if_ a deterministic database can support interactive transactions, it's one of how efficiently can they be supported. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org