He has a product to sell, so you can expect some advertising. But in general, Stonebraker's articles are very deep (another one that challenges general conceptions is http://voltdb.com/voltdb-webinar-sql-urban-myths ) . He is the creator of Postgres and considered a guru in databases by many. And actually if you cannot let go of ACID and not satisfied with traditional DBMS solutions, voltdb is worth considering. It ofcourse solves a different problem(oltp) than what Cassandra does.
On Thu, Feb 24, 2011 at 5:20 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote: > On Thu, Feb 24, 2011 at 3:56 PM, A J <s5a...@gmail.com> wrote: >> While we are at it, there's more to consider than just CAP in distributed :) >> http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors >> >> On Thu, Feb 24, 2011 at 3:31 PM, Edward Capriolo <edlinuxg...@gmail.com> >> wrote: >>> On Thu, Feb 24, 2011 at 3:03 PM, A J <s5a...@gmail.com> wrote: >>>> yes, that is difficult to digest and one has to be sure if the use >>>> case can afford it. >>>> >>>> Some other NOSQL databases deals with it differently (though I don't >>>> think any of them use atomic 2-phase commit). MongoDB for example will >>>> ask you to read from the node you wrote first (primary node) unless >>>> you are ok with eventual consistency. If the write did not make to >>>> majority of other nodes, it will be rolled-back from the original >>>> primary when it comes up again as a secondary. >>>> In some cases, you still could server either new value (that was >>>> returned as failed) or the old one. But it is different from Cassandra >>>> in the sense that Cassandra will never rollback. >>>> >>>> >>>> >>>> On Thu, Feb 24, 2011 at 2:47 PM, Anthony John <chirayit...@gmail.com> >>>> wrote: >>>>> The leap of faith here is that an error does not mean a clean backing out >>>>> to >>>>> prior state - as we are used to with databases. It means that the >>>>> operation >>>>> in error could have gone through partially >>>>> >>>>> Again, this is not an absolutely unfamiliar territory and can be dealt >>>>> with. >>>>> -JA >>>>> On Thu, Feb 24, 2011 at 1:16 PM, A J <s5a...@gmail.com> wrote: >>>>>> >>>>>> >>but could be broken in case of a failed write<< >>>>>> You can think of a scenario where R + W >N still leads to >>>>>> inconsistency even for successful writes. Say you keep W=1 and R=N . >>>>>> Lets say the one node where a write happened with success goes down >>>>>> before it made to the other N-1 nodes. Lets say it goes down for good >>>>>> and is unrecoverable. The only option is to build a new node from >>>>>> scratch from other active nodes. This will lead to a write that was >>>>>> lost and you will end up serving stale copy of it. >>>>>> >>>>>> It is better to talk in terms of use cases and if cassandra will be a >>>>>> fit for it. Otherwise unless you have W=R=N and fsync before each >>>>>> write commit, there will be scope for inconsistency. >>>>>> >>>>>> >>>>>> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <chirayit...@gmail.com> >>>>>> wrote: >>>>>> > I see the point - apologies for putting everyone through this! >>>>>> > It was just militating against my mental model. >>>>>> > In summary, here is my take away - simple stuff but - IMO - important >>>>>> > to >>>>>> > conclude this thread (I hope):- >>>>>> > 1. I was splitting hair over a failed ( partial ) Q Write. Such an >>>>>> > event >>>>>> > should be immediately followed by the same write going to a connection >>>>>> > on to >>>>>> > another node ( potentially using connection caches of client >>>>>> > implementations >>>>>> > ) or a Read at CL of All. Because a write could have partially gone >>>>>> > through. >>>>>> > 2. Timestamps are used in determining the latest version ( correcting >>>>>> > the >>>>>> > false impression I was propagating) >>>>>> > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken >>>>>> > in >>>>>> > case of a failed write as it is unsure whether the new value got >>>>>> > written >>>>>> > on >>>>>> > any server or not. Is that a fair characterization ? >>>>>> > Bottom line - unlike traditional DBMS, errors do not ensure automatic >>>>>> > cleanup and revert back, app code has to follow up if immediate - and >>>>>> > not >>>>>> > eventual - consistency is desired. I made that leap in almost all >>>>>> > cases >>>>>> > - I >>>>>> > think - but the case of a failed write. >>>>>> > My bad and I can live with this! >>>>>> > Regards, >>>>>> > -JA >>>>>> > >>>>>> > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne >>>>>> > <sylv...@datastax.com> >>>>>> > wrote: >>>>>> >> >>>>>> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <chirayit...@gmail.com> >>>>>> >> wrote: >>>>>> >>> >>>>>> >>> Completely understand! >>>>>> >>> All that I am quibbling over is whether a CL of quorum guarantees >>>>>> >>> consistency or not. That is what the documentation says - right. IF >>>>>> >>> for a CL >>>>>> >>> of Q read - it depends on which node returns read first to determine >>>>>> >>> the >>>>>> >>> actual returned result or other more convoluted conditions , then a >>>>>> >>> Quorum >>>>>> >>> read/write is not consistent, by any definition. >>>>>> >> >>>>>> >> But that's the point. The definition of consistency we are talking >>>>>> >> about >>>>>> >> has no meaning if you consider only a quorum read. The definition >>>>>> >> (which is >>>>>> >> the de facto definition of consistency in 'eventually consistent') >>>>>> >> make >>>>>> >> sense if we talk about a write followed by a read. And it is >>>>>> >> considering succeeding write followed by succeeding read. >>>>>> >> And that is the statement the wiki is making. >>>>>> >> Honestly, we could debate forever on the definition of consistency and >>>>>> >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W >>>>>> >> replica and then a (succeeding) read on R replica and if R+W>N, then >>>>>> >> it >>>>>> >> is >>>>>> >> guaranteed that the read will see the preceding write. And this is >>>>>> >> what >>>>>> >> is >>>>>> >> called consistency in the context of eventual consistency (which is >>>>>> >> not >>>>>> >> the >>>>>> >> context of ACID). >>>>>> >> If this is not the definition of consistency you had in mind then by >>>>>> >> all >>>>>> >> mean, Cassandra probably don't guarantee this definition. But given >>>>>> >> that the >>>>>> >> paragraph preceding what you pasted state clearly we are not talking >>>>>> >> about >>>>>> >> ACID consistency, but eventual consistency, I don't think the wiki is >>>>>> >> making >>>>>> >> any unfair statement. >>>>>> >> That being said, the wiki may not be always as clear as it could. But >>>>>> >> it's >>>>>> >> an editable wiki :) >>>>>> >> -- >>>>>> >> Sylvain >>>>>> >> >>>>>> >>> >>>>>> >>> I can still use Cassandra, and will use it, luv it!!! But let us not >>>>>> >>> make >>>>>> >>> this statement on the Wiki architecture section:- >>>>>> >>> ------------------------------------------------------------- >>>>>> >>> >>>>>> >>> More specifically: R=read replica count W=write replica >>>>>> >>> count N=replication factor Q=QUORUM (Q = N / 2 + 1) >>>>>> >>> >>>>>> >>> If W + R > N, you will have consistency >>>>>> >>> >>>>>> >>> W=1, R=N >>>>>> >>> W=N, R=1 >>>>>> >>> W=Q, R=Q where Q = N / 2 + 1 >>>>>> >>> >>>>>> >>> Cassandra provides consistency when R + W > N (read replica count >>>>>> >>> + write >>>>>> >>> replica count > replication factor). >>>>>> >>> >>>>>> >>> ---------------------------------------------------- >>>>>> >>> >>>>>> >>> . >>>>>> >>> >>>>>> >>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne >>>>>> >>> <sylv...@datastax.com> >>>>>> >>> wrote: >>>>>> >>>> >>>>>> >>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John >>>>>> >>>> <chirayit...@gmail.com> >>>>>> >>>> wrote: >>>>>> >>>>> >>>>>> >>>>> If you are correct and you are probably closer to the code - then >>>>>> >>>>> CL >>>>>> >>>>> of >>>>>> >>>>> Quorum does not guarantee a consistency. >>>>>> >>>> >>>>>> >>>> If the operation succeed, it does (for some definition of >>>>>> >>>> consistency >>>>>> >>>> which is, following reads at Quorum will be guaranteed to see the >>>>>> >>>> new >>>>>> >>>> value >>>>>> >>>> of a update at quorum). If it fails, then no, it does not guarantee >>>>>> >>>> consistency. >>>>>> >>>> It is important to note that the word consistency has multiple >>>>>> >>>> meaning. >>>>>> >>>> In particular, when we are talking of consistency in Cassandra, we >>>>>> >>>> are not >>>>>> >>>> talking of the same definition as the C in ACID >>>>>> >>>> >>>>>> >>>> (see: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html) >>>>>> >>>>> >>>>>> >>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne >>>>>> >>>>> <sylv...@datastax.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John >>>>>> >>>>>> <chirayit...@gmail.com> >>>>>> >>>>>> wrote: >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>Time stamps are not used for conflict resolution - unless is >>>>>> >>>>>>>> >>is >>>>>> >>>>>>>> >> part of the application logic!!! >>>>>> >>>>>>> >>>>>> >>>>>>> >>What is you definition of conflict resolution ? Because if you >>>>>> >>>>>>> >> update twice the same column (which >>>>>> >>>>>>> >>I'll call a conflict), then the timestamps are used to decide >>>>>> >>>>>>> >> which >>>>>> >>>>>>> >> update wins (which I'll call a resolution). >>>>>> >>>>>>> I understand what you are saying, and yes semantics is very >>>>>> >>>>>>> important >>>>>> >>>>>>> here. And yes we are responding to the immediate questions >>>>>> >>>>>>> without >>>>>> >>>>>>> covering >>>>>> >>>>>>> all questions in the thread. >>>>>> >>>>>>> The point being made here is that the timestamp of the column is >>>>>> >>>>>>> not >>>>>> >>>>>>> used by Cassandra to figure out what data to return. >>>>>> >>>>>> >>>>>> >>>>>> Not quite true. >>>>>> >>>>>>> >>>>>> >>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3 >>>>>> >>>>>>> A Quorum Write comes and add/updates the time stamp (TS2) of a >>>>>> >>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So >>>>>> >>>>>>> the >>>>>> >>>>>>> write is >>>>>> >>>>>>> returned as failed - right ? >>>>>> >>>>>>> Now Quorum read comes in for exactly the same piece of data that >>>>>> >>>>>>> the >>>>>> >>>>>>> write failed for. >>>>>> >>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1) >>>>>> >>>>>>> And the read succeeds - Will it return TS1 or TS2. >>>>>> >>>>>>> I submit it will return TS1 - the old TS. >>>>>> >>>>>> >>>>>> >>>>>> It all depends on which (first 2) nodes respond to the read (since >>>>>> >>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that >>>>>> >>>>>> makes the >>>>>> >>>>>> quorum, then TS2 will be returned, because cassandra will compare >>>>>> >>>>>> the >>>>>> >>>>>> timestamp and decide what to return based on this. If N2/N3 >>>>>> >>>>>> responds >>>>>> >>>>>> however, both timestamp will be TS1 and so, after timestamp >>>>>> >>>>>> resolution, it >>>>>> >>>>>> will stil be TS1 that will be returned. >>>>>> >>>>>> So yes timestamp is used for conflict resolution. >>>>>> >>>>>> In your example, you could get TS1 back because a failed write can >>>>>> >>>>>> let >>>>>> >>>>>> you cluster in an inconsistent state. You'd have to retry the >>>>>> >>>>>> quorum and >>>>>> >>>>>> only when it succeeds can you be guaranteed that quorum read will >>>>>> >>>>>> always >>>>>> >>>>>> return TS2. >>>>>> >>>>>> This is because when a write fails, Cassandra doesn't guarantee >>>>>> >>>>>> that >>>>>> >>>>>> the write did not made it in (there is no revert). >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>> >>>>>>> Are we on the same page with this interpretation ? >>>>>> >>>>>>> Regards, >>>>>> >>>>>>> -JA >>>>>> >>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne >>>>>> >>>>>>> <sylv...@datastax.com> wrote: >>>>>> >>>>>>>> >>>>>> >>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John >>>>>> >>>>>>>> <chirayit...@gmail.com> wrote: >>>>>> >>>>>>>>> >>>>>> >>>>>>>>> Sylvan, >>>>>> >>>>>>>>> Time stamps are not used for conflict resolution - unless is is >>>>>> >>>>>>>>> part of the application logic!!! >>>>>> >>>>>>>> >>>>>> >>>>>>>> What is you definition of conflict resolution ? Because if you >>>>>> >>>>>>>> update twice the same column (which >>>>>> >>>>>>>> I'll call a conflict), then the timestamps are used to decide >>>>>> >>>>>>>> which >>>>>> >>>>>>>> update wins (which I'll call a resolution). >>>>>> >>>>>>>> >>>>>> >>>>>>>>> >>>>>> >>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd >>>>>> >>>>>>>>> products - cages for e.g. - to get ACID type consistency. >>>>>> >>>>>>>> >>>>>> >>>>>>>> Then again, you'll have to define what you are calling "lost >>>>>> >>>>>>>> updates". Provided you use a reasonable consistency level, >>>>>> >>>>>>>> Cassandra >>>>>> >>>>>>>> provides fairly strong durability guarantee, so for some >>>>>> >>>>>>>> definition you >>>>>> >>>>>>>> don't "lose updates". >>>>>> >>>>>>>> That being said, I never pretended that Cassandra provided any >>>>>> >>>>>>>> ACID >>>>>> >>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't >>>>>> >>>>>>>> support. If >>>>>> >>>>>>>> we're talking about the guarantees of transaction, then by all >>>>>> >>>>>>>> means, >>>>>> >>>>>>>> cassandra won't provide it. And yes you can use cages or the >>>>>> >>>>>>>> like >>>>>> >>>>>>>> to get >>>>>> >>>>>>>> transaction. But that was not the point of the thread, was it ? >>>>>> >>>>>>>> The thread >>>>>> >>>>>>>> is about vector clocks, and that has nothing to do with >>>>>> >>>>>>>> transaction (vector >>>>>> >>>>>>>> clocks certainly don't give you transactions). >>>>>> >>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to >>>>>> >>>>>>>> why >>>>>> >>>>>>>> so far I don't think vector clocks would really provide much for >>>>>> >>>>>>>> Cassandra. >>>>>> >>>>>>>> -- >>>>>> >>>>>>>> Sylvain >>>>>> >>>>>>>> >>>>>> >>>>>>>>> >>>>>> >>>>>>>>> -JA >>>>>> >>>>>>>>> >>>>>> >>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne >>>>>> >>>>>>>>> <sylv...@datastax.com> wrote: >>>>>> >>>>>>>>>> >>>>>> >>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John >>>>>> >>>>>>>>>> <chirayit...@gmail.com> wrote: >>>>>> >>>>>>>>>>> >>>>>> >>>>>>>>>>> Apologies : For some reason my response on the original mail >>>>>> >>>>>>>>>>> keeps bouncing back, thus this new one! >>>>>> >>>>>>>>>>> >>>>>> >>>>>>>>>>> > From the other hand, the same article says: >>>>>> >>>>>>>>>>> > "For conditional writes to work, the condition must be >>>>>> >>>>>>>>>>> > evaluated at all update >>>>>> >>>>>>>>>>> > sites before the write can be allowed to succeed." >>>>>> >>>>>>>>>>> > >>>>>> >>>>>>>>>>> > This means, that when doing such an update CL=ALL must be >>>>>> >>>>>>>>>>> > used >>>>>> >>>>>>>>>>> >>>>>> >>>>>>>>>>> Sorry, but I am confused by that entire thread! >>>>>> >>>>>>>>>>> Questions:- >>>>>> >>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any >>>>>> >>>>>>>>>>> granularity whether it be row/colF/Col ? >>>>>> >>>>>>>>>> >>>>>> >>>>>>>>>> No locking, no. >>>>>> >>>>>>>>>> >>>>>> >>>>>>>>>>> >>>>>> >>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent >>>>>> >>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of >>>>>> >>>>>>>>>>> data on different >>>>>> >>>>>>>>>>> nodes can still mess each other up, right ? >>>>>> >>>>>>>>>> >>>>>> >>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any >>>>>> >>>>>>>>>> CL, >>>>>> >>>>>>>>>> updating the same piece of data means the same column value. >>>>>> >>>>>>>>>> In >>>>>> >>>>>>>>>> that case, >>>>>> >>>>>>>>>> the resolution rules are the following: >>>>>> >>>>>>>>>> - If the updates have a different timestamp, keep the one >>>>>> >>>>>>>>>> with >>>>>> >>>>>>>>>> the higher timestamp. That is, the more recent of two updates >>>>>> >>>>>>>>>> win. >>>>>> >>>>>>>>>> - It the timestamps are the same, then it compares the >>>>>> >>>>>>>>>> values >>>>>> >>>>>>>>>> (byte comparison) and keep the highest value. This is just to >>>>>> >>>>>>>>>> break ties in >>>>>> >>>>>>>>>> a consistent manner. >>>>>> >>>>>>>>>> So if you do two truly concurrent updates (that is from two >>>>>> >>>>>>>>>> place >>>>>> >>>>>>>>>> at the same instant), then you'll end with one of the update. >>>>>> >>>>>>>>>> This is the >>>>>> >>>>>>>>>> column level. >>>>>> >>>>>>>>>> However, if that simple conflict detection/resolution >>>>>> >>>>>>>>>> mechanism >>>>>> >>>>>>>>>> is >>>>>> >>>>>>>>>> not good enough for some of your use case and you need to keep >>>>>> >>>>>>>>>> two >>>>>> >>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the >>>>>> >>>>>>>>>> update don't >>>>>> >>>>>>>>>> end up in the same column. This is easily achieved by >>>>>> >>>>>>>>>> appending >>>>>> >>>>>>>>>> some unique >>>>>> >>>>>>>>>> identifier to the column name for instance. And when reading, >>>>>> >>>>>>>>>> do a slice and >>>>>> >>>>>>>>>> reconcile whatever you get back with whatever logic make >>>>>> >>>>>>>>>> sense. >>>>>> >>>>>>>>>> If you do >>>>>> >>>>>>>>>> that, congrats, you've roughly emulated what vector clocks >>>>>> >>>>>>>>>> would do. Btw, no >>>>>> >>>>>>>>>> locking or anything needed. >>>>>> >>>>>>>>>> In my experience, for most things the timestamp resolution is >>>>>> >>>>>>>>>> enough. If the same user update twice it's profile picture on >>>>>> >>>>>>>>>> you web site >>>>>> >>>>>>>>>> at the same microsecond, it's usually fine to end up with one >>>>>> >>>>>>>>>> of the two >>>>>> >>>>>>>>>> pictures. In the rare case where you need something more >>>>>> >>>>>>>>>> specific, using the >>>>>> >>>>>>>>>> cassandra data model usually solves the problem easily. The >>>>>> >>>>>>>>>> reason for not >>>>>> >>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't >>>>>> >>>>>>>>>> really found >>>>>> >>>>>>>>>> much example where it is no the case. >>>>>> >>>>>>>>>> >>>>>> >>>>>>>>>> -- >>>>>> >>>>>>>>>> Sylvain >>>>>> >>>>>>>>> >>>>>> >>>>>>>> >>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>>> >>>> >>>>>> >>> >>>>>> >> >>>>>> > >>>>>> > >>>>> >>>>> >>>> >>> >>> >>> Just to make a note the "EVENTUAL" in eventual consistency could be a >>> time that is less then 1ms. >>> >>> I have a program that demonstrates that "eventual" means if i write >>> data at the weakest level, and read it back from a random another node >>> as soon as possible. 99% I see the update. I can share the code if you >>> would like. >>> >>> Remember http://en.wikipedia.org/wiki/Spacetime >>> ...but there is no reference frame in which the two events can occur >>> at the same time... >>> >>> As to MongoDB references ....Yes! most of the noSQL work differently. >>> They each approach CAP >>> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem in a >>> different way. >>> >>> Cassandra does not lock (it is no secret). But remember, you can not >>> have it all pick 2/3 from CAP. >>> >> > > http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors > I was reading that and many of the points were well taken...up until... > > Next generation DBMS technologies, such as VoltDB, have been shown to > run around 50X the speed of conventional SQL engines. Thus, if you > need 200 nodes to support a specific SQL application, then VoltDB can > probably do the same application on 4 nodes. The probability of a > failure on 200 nodes is wildly different than the probability of > failure on four nodes. > > Come on? 200 nodes down to 4? I just can not take it seriously any more. >