Hey Marek, you’ve put a lot of effort in your proposal and first of all: Thanks for that!
If I understood you right, your proposal is about full ACID support ; able to handle multiple statements in a single transaction. Cassandra is a database used in distributed environments - a few servers sitting nearby to each other up to 1000+ servers on the other side of the globe connected via a more or probably less reliable network connection. Cassandra is also a database that can handle many millions of operations per second. Why do I say that? If you now imagine that _one_ transaction manager is no longer reachable by the client/coordinator or the involved replica nodes - what do you do? I mean, you might be able to implement the A and D - but what about I and C? IMO there is no way around a global lock/transaction manager for ACID. You could optimize some things and take some load of “the central” instance - but you end up coordinating stuff at that transaction manager. Implementing distributed locks is a nightmare - especially if the lock spans the whole world. Proposing transactions for C* (full ACID or not) would need a thorough concept covering all these “little nice” edge cases that can happen in a huge distributed system and also deals with the problems that arise with many 10k clients. Just think of: clients fail/stuck, nodes fail/stuck, network gets broken, partitioned, flappy etc etc. Robert > On 07 Aug 2015, at 09:18, Marek Lewandowski <marekmlewandow...@gmail.com> > wrote: > > Hello everyone, > > *TL;DR;* I want to develop transactions (similar to those relational ones) > for Cassandra, I have some ideas and I'd like to hear your feedback. > > *Long story short:* I want to develop prototype of solution that features > transactions spanning multiple Cassandra partitions resembling those in > relational databases. I understand that in Cassandra's world such > transactions will be quite a bit different than in relational db world. > That prototype or research if you will, is subject of my master's thesis. > > It's been some time since I've been dancing around that subject and > postponing it, but what I understood in the process is that I cannot do > anything useful being hidden from community's feedback. I want you to > provide feedback on my ideas. You also have a power of changing it > completely because first and foremost _I want to develop something actually > useful, not only get my master's degree._ > > To not scare anyone with huge wall of text, for now I'll just post very > brief description of my ideas and longer block of text describing how I can > see committing and rolling back data. I have more detailed descriptions > prepared already, but I don't want to share 4 pages of text in one go. > > Scope of the project (assuming I'll do it alone) is an experiment, not > production ready solution. > Such experiment can use any tools possible to actually perform it. > > Baseline for any idea is: > - asynchronous execution - think Akka and actors with non blocking > execution and message passing > - doesn't have to be completely transparent for end user - solution may > enforce certain workflow of interacting with it and/or introduce new > concepts (like akka messages instead of CQL and binary protocol). > > So far after reading a lot and thinking even more I have two ideas that I'd > like to share. > > ### Ideas are (with brief description, details will come later): ### > > > #### Idea 1: Event streaming #### > - Imagine that every modiifcation is represented by an _Event_. > - Imagine you can group these events into _Event Groups_. > - Imagine that such groups are time series data > - Imagine you can read such groups as a stream of events (think reactive > stream) > > Idea is that: you don't need to lock data when you are sure there is no one > else to compete with. > > There is 1 guy called _Cursor_ that reads Event Stream and executes Event > Groups one by one advacing its position on the stream when Event Group has > been executed. > > Seems like a system where you have only 1 transaction at any given time, > but there are many areas to optimize that and to allow more than that. > However I'll stop here. > > #### Idea 2: Locking data #### > - uses additional tables to acquire read/write locks > - seperate tables to append modifications - as in "Rollback: Appending to > seperate table." > - supports different isolation levels. > - more traditional approach, kind of translation of traditional locking to > cassandra reality. > > > ------------------------------------------------------------------------------------------- > Common part of two is approach to doing commit and rollback. > ### Doing Rollback and commit ### > I have two ideas for rollback. I like 2nd one more, because it is simpler > and potentially faster. > > #### Rollback: Query rewriting #### > It modifies original data, but before that it copies original data so that > state can be restored. Then when failure is detected, modification query > can be rewritten so that original data can be restored. > > Query rewriting seems like a complex functionality. I tried few simple and > a little bit more complex statements and in general for basic stuff > algorithm is not that complicated, but to support everything CQL has to > offer it might be hard. > > Still such transactional system might have some restrictions over CQL > statements used, because first of all when someone wants to have these > transactions they already want something non standard. > > I will skip details of that approach for now. > > #### Rollback: Appending to seperate table. #### > Image we have table A that we want to have transactions on. > This requires another table A_tx which has same schema as A, but has *1 > more clustering column* and few new columns. A_tx will be additionally > clustered by transaction id. > New columns are: > - is_committed boolean > - is_rolledback boolean > - is_applied boolean > > General idea is: > 1. During transaction append changes to XXX_tx tables. > 2. For rollback: nothing needs to be done (besides cleaning XXX_tx tables > of useless data scheduled for rollback) > 3. For commit: rows in each XXX_tx are marked as committed. This can be > done using BATCH update so that all rows affected by transactions are > committed. These changes will be eventually merged back into original row. > > Committed changes are visible during query, because query has to select > from 2 tables. If you query for XXX table then you have to query that > table, but also XXX_TX and get all committed data, merge result and return > that to client. > > Here committed data eventually lands into proper row - during read as > background process for example (this is this is_applied column) results are > merged and inserted into original row, plus additionally modifications can > be marked as _applied_. > Uncommitted data can also be eventually cleaned up. > > *Important note:* since partitioning key stays the same for {{XXX}} table > and {{XXX_TX}} table, data will reside on same nodes so that queries and > application of data can be done locally. > > ### What happens next ### > Assuming I get any feedback I'll post more detailed descriptions of two > approaches. > > I would love to hear your feedback on whole subject. Just to begin > discussion and pick your interest. > > What you think about having more heavy transactions? > Does this experiment has sense at all? > > regards > -- > Marek Lewandowski — Robert Stupp @snazy