That article is heavily biased by "I am selling a competitor to Cassandra."
First, read Coda's original piece if you haven't: http://codahale.com/you-cant-sacrifice-partition-tolerance/ Then, Jeff Darcy's response: http://pl.atyp.us/wordpress/?p=3110 On Thu, Feb 24, 2011 at 2:56 PM, A J <s5a...@gmail.com> wrote: > While we are at it, there's more to consider than just CAP in distributed :) > http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors > > On Thu, Feb 24, 2011 at 3:31 PM, Edward Capriolo <edlinuxg...@gmail.com> > wrote: >> On Thu, Feb 24, 2011 at 3:03 PM, A J <s5a...@gmail.com> wrote: >>> yes, that is difficult to digest and one has to be sure if the use >>> case can afford it. >>> >>> Some other NOSQL databases deals with it differently (though I don't >>> think any of them use atomic 2-phase commit). MongoDB for example will >>> ask you to read from the node you wrote first (primary node) unless >>> you are ok with eventual consistency. If the write did not make to >>> majority of other nodes, it will be rolled-back from the original >>> primary when it comes up again as a secondary. >>> In some cases, you still could server either new value (that was >>> returned as failed) or the old one. But it is different from Cassandra >>> in the sense that Cassandra will never rollback. >>> >>> >>> >>> On Thu, Feb 24, 2011 at 2:47 PM, Anthony John <chirayit...@gmail.com> wrote: >>>> The leap of faith here is that an error does not mean a clean backing out >>>> to >>>> prior state - as we are used to with databases. It means that the operation >>>> in error could have gone through partially >>>> >>>> Again, this is not an absolutely unfamiliar territory and can be dealt >>>> with. >>>> -JA >>>> On Thu, Feb 24, 2011 at 1:16 PM, A J <s5a...@gmail.com> wrote: >>>>> >>>>> >>but could be broken in case of a failed write<< >>>>> You can think of a scenario where R + W >N still leads to >>>>> inconsistency even for successful writes. Say you keep W=1 and R=N . >>>>> Lets say the one node where a write happened with success goes down >>>>> before it made to the other N-1 nodes. Lets say it goes down for good >>>>> and is unrecoverable. The only option is to build a new node from >>>>> scratch from other active nodes. This will lead to a write that was >>>>> lost and you will end up serving stale copy of it. >>>>> >>>>> It is better to talk in terms of use cases and if cassandra will be a >>>>> fit for it. Otherwise unless you have W=R=N and fsync before each >>>>> write commit, there will be scope for inconsistency. >>>>> >>>>> >>>>> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <chirayit...@gmail.com> >>>>> wrote: >>>>> > I see the point - apologies for putting everyone through this! >>>>> > It was just militating against my mental model. >>>>> > In summary, here is my take away - simple stuff but - IMO - important to >>>>> > conclude this thread (I hope):- >>>>> > 1. I was splitting hair over a failed ( partial ) Q Write. Such an event >>>>> > should be immediately followed by the same write going to a connection >>>>> > on to >>>>> > another node ( potentially using connection caches of client >>>>> > implementations >>>>> > ) or a Read at CL of All. Because a write could have partially gone >>>>> > through. >>>>> > 2. Timestamps are used in determining the latest version ( correcting >>>>> > the >>>>> > false impression I was propagating) >>>>> > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken >>>>> > in >>>>> > case of a failed write as it is unsure whether the new value got written >>>>> > on >>>>> > any server or not. Is that a fair characterization ? >>>>> > Bottom line - unlike traditional DBMS, errors do not ensure automatic >>>>> > cleanup and revert back, app code has to follow up if immediate - and >>>>> > not >>>>> > eventual - consistency is desired. I made that leap in almost all cases >>>>> > - I >>>>> > think - but the case of a failed write. >>>>> > My bad and I can live with this! >>>>> > Regards, >>>>> > -JA >>>>> > >>>>> > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne >>>>> > <sylv...@datastax.com> >>>>> > wrote: >>>>> >> >>>>> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <chirayit...@gmail.com> >>>>> >> wrote: >>>>> >>> >>>>> >>> Completely understand! >>>>> >>> All that I am quibbling over is whether a CL of quorum guarantees >>>>> >>> consistency or not. That is what the documentation says - right. IF >>>>> >>> for a CL >>>>> >>> of Q read - it depends on which node returns read first to determine >>>>> >>> the >>>>> >>> actual returned result or other more convoluted conditions , then a >>>>> >>> Quorum >>>>> >>> read/write is not consistent, by any definition. >>>>> >> >>>>> >> But that's the point. The definition of consistency we are talking >>>>> >> about >>>>> >> has no meaning if you consider only a quorum read. The definition >>>>> >> (which is >>>>> >> the de facto definition of consistency in 'eventually consistent') make >>>>> >> sense if we talk about a write followed by a read. And it is >>>>> >> considering succeeding write followed by succeeding read. >>>>> >> And that is the statement the wiki is making. >>>>> >> Honestly, we could debate forever on the definition of consistency and >>>>> >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W >>>>> >> replica and then a (succeeding) read on R replica and if R+W>N, then it >>>>> >> is >>>>> >> guaranteed that the read will see the preceding write. And this is what >>>>> >> is >>>>> >> called consistency in the context of eventual consistency (which is not >>>>> >> the >>>>> >> context of ACID). >>>>> >> If this is not the definition of consistency you had in mind then by >>>>> >> all >>>>> >> mean, Cassandra probably don't guarantee this definition. But given >>>>> >> that the >>>>> >> paragraph preceding what you pasted state clearly we are not talking >>>>> >> about >>>>> >> ACID consistency, but eventual consistency, I don't think the wiki is >>>>> >> making >>>>> >> any unfair statement. >>>>> >> That being said, the wiki may not be always as clear as it could. But >>>>> >> it's >>>>> >> an editable wiki :) >>>>> >> -- >>>>> >> Sylvain >>>>> >> >>>>> >>> >>>>> >>> I can still use Cassandra, and will use it, luv it!!! But let us not >>>>> >>> make >>>>> >>> this statement on the Wiki architecture section:- >>>>> >>> ------------------------------------------------------------- >>>>> >>> >>>>> >>> More specifically: R=read replica count W=write replica >>>>> >>> count N=replication factor Q=QUORUM (Q = N / 2 + 1) >>>>> >>> >>>>> >>> If W + R > N, you will have consistency >>>>> >>> >>>>> >>> W=1, R=N >>>>> >>> W=N, R=1 >>>>> >>> W=Q, R=Q where Q = N / 2 + 1 >>>>> >>> >>>>> >>> Cassandra provides consistency when R + W > N (read replica count >>>>> >>> + write >>>>> >>> replica count > replication factor). >>>>> >>> >>>>> >>> ---------------------------------------------------- >>>>> >>> >>>>> >>> . >>>>> >>> >>>>> >>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne >>>>> >>> <sylv...@datastax.com> >>>>> >>> wrote: >>>>> >>>> >>>>> >>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <chirayit...@gmail.com> >>>>> >>>> wrote: >>>>> >>>>> >>>>> >>>>> If you are correct and you are probably closer to the code - then CL >>>>> >>>>> of >>>>> >>>>> Quorum does not guarantee a consistency. >>>>> >>>> >>>>> >>>> If the operation succeed, it does (for some definition of consistency >>>>> >>>> which is, following reads at Quorum will be guaranteed to see the new >>>>> >>>> value >>>>> >>>> of a update at quorum). If it fails, then no, it does not guarantee >>>>> >>>> consistency. >>>>> >>>> It is important to note that the word consistency has multiple >>>>> >>>> meaning. >>>>> >>>> In particular, when we are talking of consistency in Cassandra, we >>>>> >>>> are not >>>>> >>>> talking of the same definition as the C in ACID >>>>> >>>> >>>>> >>>> (see: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html) >>>>> >>>>> >>>>> >>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne >>>>> >>>>> <sylv...@datastax.com> wrote: >>>>> >>>>>> >>>>> >>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John >>>>> >>>>>> <chirayit...@gmail.com> >>>>> >>>>>> wrote: >>>>> >>>>>>>> >>>>> >>>>>>>> >>Time stamps are not used for conflict resolution - unless is is >>>>> >>>>>>>> >> part of the application logic!!! >>>>> >>>>>>> >>>>> >>>>>>> >>What is you definition of conflict resolution ? Because if you >>>>> >>>>>>> >> update twice the same column (which >>>>> >>>>>>> >>I'll call a conflict), then the timestamps are used to decide >>>>> >>>>>>> >> which >>>>> >>>>>>> >> update wins (which I'll call a resolution). >>>>> >>>>>>> I understand what you are saying, and yes semantics is very >>>>> >>>>>>> important >>>>> >>>>>>> here. And yes we are responding to the immediate questions without >>>>> >>>>>>> covering >>>>> >>>>>>> all questions in the thread. >>>>> >>>>>>> The point being made here is that the timestamp of the column is >>>>> >>>>>>> not >>>>> >>>>>>> used by Cassandra to figure out what data to return. >>>>> >>>>>> >>>>> >>>>>> Not quite true. >>>>> >>>>>>> >>>>> >>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3 >>>>> >>>>>>> A Quorum Write comes and add/updates the time stamp (TS2) of a >>>>> >>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the >>>>> >>>>>>> write is >>>>> >>>>>>> returned as failed - right ? >>>>> >>>>>>> Now Quorum read comes in for exactly the same piece of data that >>>>> >>>>>>> the >>>>> >>>>>>> write failed for. >>>>> >>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1) >>>>> >>>>>>> And the read succeeds - Will it return TS1 or TS2. >>>>> >>>>>>> I submit it will return TS1 - the old TS. >>>>> >>>>>> >>>>> >>>>>> It all depends on which (first 2) nodes respond to the read (since >>>>> >>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that >>>>> >>>>>> makes the >>>>> >>>>>> quorum, then TS2 will be returned, because cassandra will compare >>>>> >>>>>> the >>>>> >>>>>> timestamp and decide what to return based on this. If N2/N3 >>>>> >>>>>> responds >>>>> >>>>>> however, both timestamp will be TS1 and so, after timestamp >>>>> >>>>>> resolution, it >>>>> >>>>>> will stil be TS1 that will be returned. >>>>> >>>>>> So yes timestamp is used for conflict resolution. >>>>> >>>>>> In your example, you could get TS1 back because a failed write can >>>>> >>>>>> let >>>>> >>>>>> you cluster in an inconsistent state. You'd have to retry the >>>>> >>>>>> quorum and >>>>> >>>>>> only when it succeeds can you be guaranteed that quorum read will >>>>> >>>>>> always >>>>> >>>>>> return TS2. >>>>> >>>>>> This is because when a write fails, Cassandra doesn't guarantee >>>>> >>>>>> that >>>>> >>>>>> the write did not made it in (there is no revert). >>>>> >>>>>> >>>>> >>>>>>> >>>>> >>>>>>> Are we on the same page with this interpretation ? >>>>> >>>>>>> Regards, >>>>> >>>>>>> -JA >>>>> >>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne >>>>> >>>>>>> <sylv...@datastax.com> wrote: >>>>> >>>>>>>> >>>>> >>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John >>>>> >>>>>>>> <chirayit...@gmail.com> wrote: >>>>> >>>>>>>>> >>>>> >>>>>>>>> Sylvan, >>>>> >>>>>>>>> Time stamps are not used for conflict resolution - unless is is >>>>> >>>>>>>>> part of the application logic!!! >>>>> >>>>>>>> >>>>> >>>>>>>> What is you definition of conflict resolution ? Because if you >>>>> >>>>>>>> update twice the same column (which >>>>> >>>>>>>> I'll call a conflict), then the timestamps are used to decide >>>>> >>>>>>>> which >>>>> >>>>>>>> update wins (which I'll call a resolution). >>>>> >>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd >>>>> >>>>>>>>> products - cages for e.g. - to get ACID type consistency. >>>>> >>>>>>>> >>>>> >>>>>>>> Then again, you'll have to define what you are calling "lost >>>>> >>>>>>>> updates". Provided you use a reasonable consistency level, >>>>> >>>>>>>> Cassandra >>>>> >>>>>>>> provides fairly strong durability guarantee, so for some >>>>> >>>>>>>> definition you >>>>> >>>>>>>> don't "lose updates". >>>>> >>>>>>>> That being said, I never pretended that Cassandra provided any >>>>> >>>>>>>> ACID >>>>> >>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't >>>>> >>>>>>>> support. If >>>>> >>>>>>>> we're talking about the guarantees of transaction, then by all >>>>> >>>>>>>> means, >>>>> >>>>>>>> cassandra won't provide it. And yes you can use cages or the like >>>>> >>>>>>>> to get >>>>> >>>>>>>> transaction. But that was not the point of the thread, was it ? >>>>> >>>>>>>> The thread >>>>> >>>>>>>> is about vector clocks, and that has nothing to do with >>>>> >>>>>>>> transaction (vector >>>>> >>>>>>>> clocks certainly don't give you transactions). >>>>> >>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to >>>>> >>>>>>>> why >>>>> >>>>>>>> so far I don't think vector clocks would really provide much for >>>>> >>>>>>>> Cassandra. >>>>> >>>>>>>> -- >>>>> >>>>>>>> Sylvain >>>>> >>>>>>>> >>>>> >>>>>>>>> >>>>> >>>>>>>>> -JA >>>>> >>>>>>>>> >>>>> >>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne >>>>> >>>>>>>>> <sylv...@datastax.com> wrote: >>>>> >>>>>>>>>> >>>>> >>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John >>>>> >>>>>>>>>> <chirayit...@gmail.com> wrote: >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> Apologies : For some reason my response on the original mail >>>>> >>>>>>>>>>> keeps bouncing back, thus this new one! >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> > From the other hand, the same article says: >>>>> >>>>>>>>>>> > "For conditional writes to work, the condition must be >>>>> >>>>>>>>>>> > evaluated at all update >>>>> >>>>>>>>>>> > sites before the write can be allowed to succeed." >>>>> >>>>>>>>>>> > >>>>> >>>>>>>>>>> > This means, that when doing such an update CL=ALL must be >>>>> >>>>>>>>>>> > used >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> Sorry, but I am confused by that entire thread! >>>>> >>>>>>>>>>> Questions:- >>>>> >>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any >>>>> >>>>>>>>>>> granularity whether it be row/colF/Col ? >>>>> >>>>>>>>>> >>>>> >>>>>>>>>> No locking, no. >>>>> >>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> >>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent >>>>> >>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of >>>>> >>>>>>>>>>> data on different >>>>> >>>>>>>>>>> nodes can still mess each other up, right ? >>>>> >>>>>>>>>> >>>>> >>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL, >>>>> >>>>>>>>>> updating the same piece of data means the same column value. In >>>>> >>>>>>>>>> that case, >>>>> >>>>>>>>>> the resolution rules are the following: >>>>> >>>>>>>>>> - If the updates have a different timestamp, keep the one >>>>> >>>>>>>>>> with >>>>> >>>>>>>>>> the higher timestamp. That is, the more recent of two updates >>>>> >>>>>>>>>> win. >>>>> >>>>>>>>>> - It the timestamps are the same, then it compares the values >>>>> >>>>>>>>>> (byte comparison) and keep the highest value. This is just to >>>>> >>>>>>>>>> break ties in >>>>> >>>>>>>>>> a consistent manner. >>>>> >>>>>>>>>> So if you do two truly concurrent updates (that is from two >>>>> >>>>>>>>>> place >>>>> >>>>>>>>>> at the same instant), then you'll end with one of the update. >>>>> >>>>>>>>>> This is the >>>>> >>>>>>>>>> column level. >>>>> >>>>>>>>>> However, if that simple conflict detection/resolution mechanism >>>>> >>>>>>>>>> is >>>>> >>>>>>>>>> not good enough for some of your use case and you need to keep >>>>> >>>>>>>>>> two >>>>> >>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the >>>>> >>>>>>>>>> update don't >>>>> >>>>>>>>>> end up in the same column. This is easily achieved by appending >>>>> >>>>>>>>>> some unique >>>>> >>>>>>>>>> identifier to the column name for instance. And when reading, >>>>> >>>>>>>>>> do a slice and >>>>> >>>>>>>>>> reconcile whatever you get back with whatever logic make sense. >>>>> >>>>>>>>>> If you do >>>>> >>>>>>>>>> that, congrats, you've roughly emulated what vector clocks >>>>> >>>>>>>>>> would do. Btw, no >>>>> >>>>>>>>>> locking or anything needed. >>>>> >>>>>>>>>> In my experience, for most things the timestamp resolution is >>>>> >>>>>>>>>> enough. If the same user update twice it's profile picture on >>>>> >>>>>>>>>> you web site >>>>> >>>>>>>>>> at the same microsecond, it's usually fine to end up with one >>>>> >>>>>>>>>> of the two >>>>> >>>>>>>>>> pictures. In the rare case where you need something more >>>>> >>>>>>>>>> specific, using the >>>>> >>>>>>>>>> cassandra data model usually solves the problem easily. The >>>>> >>>>>>>>>> reason for not >>>>> >>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't >>>>> >>>>>>>>>> really found >>>>> >>>>>>>>>> much example where it is no the case. >>>>> >>>>>>>>>> >>>>> >>>>>>>>>> -- >>>>> >>>>>>>>>> Sylvain >>>>> >>>>>>>>> >>>>> >>>>>>>> >>>>> >>>>>>> >>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>>> >>> >>>>> >> >>>>> > >>>>> > >>>> >>>> >>> >> >> >> Just to make a note the "EVENTUAL" in eventual consistency could be a >> time that is less then 1ms. >> >> I have a program that demonstrates that "eventual" means if i write >> data at the weakest level, and read it back from a random another node >> as soon as possible. 99% I see the update. I can share the code if you >> would like. >> >> Remember http://en.wikipedia.org/wiki/Spacetime >> ...but there is no reference frame in which the two events can occur >> at the same time... >> >> As to MongoDB references ....Yes! most of the noSQL work differently. >> They each approach CAP >> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem in a >> different way. >> >> Cassandra does not lock (it is no secret). But remember, you can not >> have it all pick 2/3 from CAP. >> > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com