And everyone has a bias - and I think most people working with any of these solutions realizes that.
I think it's interesting how many organizations use multiple data storage solutions versus just using one as they have different capabilities - like the recent Netflix news about using different data stores for different reasons. On Feb 25, 2011, at 10:21 AM, A J wrote: > Though you are not really implying that, I am not selling anything. I > don't work for VoltDB. I had other issues for my use case with the > software when I was evaluating it (their claim of durability is weak > according to me. Though it does not matter I'd rather they call > themselves NOSQL. they just give lip-service to SQL) > I'd rather not drink any sort of kool-aid, get all sides (whatever the > motive of the sides be) and be the judge myself for what I want to do. > > The thread was by someone who seems to be having difficulty wrapping > head around the gives and takes of cassandra. maybe something else is > better for their use case. > > Peace :) > > > On Fri, Feb 25, 2011 at 10:39 AM, Jonathan Ellis <jbel...@gmail.com> wrote: >> That article is heavily biased by "I am selling a competitor to Cassandra." >> >> First, read Coda's original piece if you haven't: >> http://codahale.com/you-cant-sacrifice-partition-tolerance/ >> >> Then, Jeff Darcy's response: http://pl.atyp.us/wordpress/?p=3110 >> >> On Thu, Feb 24, 2011 at 2:56 PM, A J <s5a...@gmail.com> wrote: >>> While we are at it, there's more to consider than just CAP in distributed :) >>> http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors >>> >>> On Thu, Feb 24, 2011 at 3:31 PM, Edward Capriolo <edlinuxg...@gmail.com> >>> wrote: >>>> On Thu, Feb 24, 2011 at 3:03 PM, A J <s5a...@gmail.com> wrote: >>>>> yes, that is difficult to digest and one has to be sure if the use >>>>> case can afford it. >>>>> >>>>> Some other NOSQL databases deals with it differently (though I don't >>>>> think any of them use atomic 2-phase commit). MongoDB for example will >>>>> ask you to read from the node you wrote first (primary node) unless >>>>> you are ok with eventual consistency. If the write did not make to >>>>> majority of other nodes, it will be rolled-back from the original >>>>> primary when it comes up again as a secondary. >>>>> In some cases, you still could server either new value (that was >>>>> returned as failed) or the old one. But it is different from Cassandra >>>>> in the sense that Cassandra will never rollback. >>>>> >>>>> >>>>> >>>>> On Thu, Feb 24, 2011 at 2:47 PM, Anthony John <chirayit...@gmail.com> >>>>> wrote: >>>>>> The leap of faith here is that an error does not mean a clean backing >>>>>> out to >>>>>> prior state - as we are used to with databases. It means that the >>>>>> operation >>>>>> in error could have gone through partially >>>>>> >>>>>> Again, this is not an absolutely unfamiliar territory and can be dealt >>>>>> with. >>>>>> -JA >>>>>> On Thu, Feb 24, 2011 at 1:16 PM, A J <s5a...@gmail.com> wrote: >>>>>>> >>>>>>>>> but could be broken in case of a failed write<< >>>>>>> You can think of a scenario where R + W >N still leads to >>>>>>> inconsistency even for successful writes. Say you keep W=1 and R=N . >>>>>>> Lets say the one node where a write happened with success goes down >>>>>>> before it made to the other N-1 nodes. Lets say it goes down for good >>>>>>> and is unrecoverable. The only option is to build a new node from >>>>>>> scratch from other active nodes. This will lead to a write that was >>>>>>> lost and you will end up serving stale copy of it. >>>>>>> >>>>>>> It is better to talk in terms of use cases and if cassandra will be a >>>>>>> fit for it. Otherwise unless you have W=R=N and fsync before each >>>>>>> write commit, there will be scope for inconsistency. >>>>>>> >>>>>>> >>>>>>> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <chirayit...@gmail.com> >>>>>>> wrote: >>>>>>>> I see the point - apologies for putting everyone through this! >>>>>>>> It was just militating against my mental model. >>>>>>>> In summary, here is my take away - simple stuff but - IMO - important >>>>>>>> to >>>>>>>> conclude this thread (I hope):- >>>>>>>> 1. I was splitting hair over a failed ( partial ) Q Write. Such an >>>>>>>> event >>>>>>>> should be immediately followed by the same write going to a connection >>>>>>>> on to >>>>>>>> another node ( potentially using connection caches of client >>>>>>>> implementations >>>>>>>> ) or a Read at CL of All. Because a write could have partially gone >>>>>>>> through. >>>>>>>> 2. Timestamps are used in determining the latest version ( correcting >>>>>>>> the >>>>>>>> false impression I was propagating) >>>>>>>> Finally, wrt "W + R > N for Q CL statement" holds, but could be broken >>>>>>>> in >>>>>>>> case of a failed write as it is unsure whether the new value got >>>>>>>> written >>>>>>>> on >>>>>>>> any server or not. Is that a fair characterization ? >>>>>>>> Bottom line - unlike traditional DBMS, errors do not ensure automatic >>>>>>>> cleanup and revert back, app code has to follow up if immediate - and >>>>>>>> not >>>>>>>> eventual - consistency is desired. I made that leap in almost all >>>>>>>> cases >>>>>>>> - I >>>>>>>> think - but the case of a failed write. >>>>>>>> My bad and I can live with this! >>>>>>>> Regards, >>>>>>>> -JA >>>>>>>> >>>>>>>> On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne >>>>>>>> <sylv...@datastax.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <chirayit...@gmail.com> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Completely understand! >>>>>>>>>> All that I am quibbling over is whether a CL of quorum guarantees >>>>>>>>>> consistency or not. That is what the documentation says - right. IF >>>>>>>>>> for a CL >>>>>>>>>> of Q read - it depends on which node returns read first to determine >>>>>>>>>> the >>>>>>>>>> actual returned result or other more convoluted conditions , then a >>>>>>>>>> Quorum >>>>>>>>>> read/write is not consistent, by any definition. >>>>>>>>> >>>>>>>>> But that's the point. The definition of consistency we are talking >>>>>>>>> about >>>>>>>>> has no meaning if you consider only a quorum read. The definition >>>>>>>>> (which is >>>>>>>>> the de facto definition of consistency in 'eventually consistent') >>>>>>>>> make >>>>>>>>> sense if we talk about a write followed by a read. And it is >>>>>>>>> considering succeeding write followed by succeeding read. >>>>>>>>> And that is the statement the wiki is making. >>>>>>>>> Honestly, we could debate forever on the definition of consistency and >>>>>>>>> whatnot. Cassandra guaranties that if you do a (succeeding) write on W >>>>>>>>> replica and then a (succeeding) read on R replica and if R+W>N, then >>>>>>>>> it >>>>>>>>> is >>>>>>>>> guaranteed that the read will see the preceding write. And this is >>>>>>>>> what >>>>>>>>> is >>>>>>>>> called consistency in the context of eventual consistency (which is >>>>>>>>> not >>>>>>>>> the >>>>>>>>> context of ACID). >>>>>>>>> If this is not the definition of consistency you had in mind then by >>>>>>>>> all >>>>>>>>> mean, Cassandra probably don't guarantee this definition. But given >>>>>>>>> that the >>>>>>>>> paragraph preceding what you pasted state clearly we are not talking >>>>>>>>> about >>>>>>>>> ACID consistency, but eventual consistency, I don't think the wiki is >>>>>>>>> making >>>>>>>>> any unfair statement. >>>>>>>>> That being said, the wiki may not be always as clear as it could. But >>>>>>>>> it's >>>>>>>>> an editable wiki :) >>>>>>>>> -- >>>>>>>>> Sylvain >>>>>>>>> >>>>>>>>>> >>>>>>>>>> I can still use Cassandra, and will use it, luv it!!! But let us not >>>>>>>>>> make >>>>>>>>>> this statement on the Wiki architecture section:- >>>>>>>>>> ------------------------------------------------------------- >>>>>>>>>> >>>>>>>>>> More specifically: R=read replica count W=write replica >>>>>>>>>> count N=replication factor Q=QUORUM (Q = N / 2 + 1) >>>>>>>>>> >>>>>>>>>> If W + R > N, you will have consistency >>>>>>>>>> >>>>>>>>>> W=1, R=N >>>>>>>>>> W=N, R=1 >>>>>>>>>> W=Q, R=Q where Q = N / 2 + 1 >>>>>>>>>> >>>>>>>>>> Cassandra provides consistency when R + W > N (read replica count >>>>>>>>>> + write >>>>>>>>>> replica count > replication factor). >>>>>>>>>> >>>>>>>>>> ---------------------------------------------------- >>>>>>>>>> >>>>>>>>>> . >>>>>>>>>> >>>>>>>>>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne >>>>>>>>>> <sylv...@datastax.com> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John >>>>>>>>>>> <chirayit...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> If you are correct and you are probably closer to the code - then >>>>>>>>>>>> CL >>>>>>>>>>>> of >>>>>>>>>>>> Quorum does not guarantee a consistency. >>>>>>>>>>> >>>>>>>>>>> If the operation succeed, it does (for some definition of >>>>>>>>>>> consistency >>>>>>>>>>> which is, following reads at Quorum will be guaranteed to see the >>>>>>>>>>> new >>>>>>>>>>> value >>>>>>>>>>> of a update at quorum). If it fails, then no, it does not guarantee >>>>>>>>>>> consistency. >>>>>>>>>>> It is important to note that the word consistency has multiple >>>>>>>>>>> meaning. >>>>>>>>>>> In particular, when we are talking of consistency in Cassandra, we >>>>>>>>>>> are not >>>>>>>>>>> talking of the same definition as the C in ACID >>>>>>>>>>> >>>>>>>>>>> (see: >>>>>>>>>>> http://www.allthingsdistributed.com/2007/12/eventually_consistent.html) >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne >>>>>>>>>>>> <sylv...@datastax.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John >>>>>>>>>>>>> <chirayit...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Time stamps are not used for conflict resolution - unless is >>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>> part of the application logic!!! >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> What is you definition of conflict resolution ? Because if you >>>>>>>>>>>>>>>> update twice the same column (which >>>>>>>>>>>>>>>> I'll call a conflict), then the timestamps are used to decide >>>>>>>>>>>>>>>> which >>>>>>>>>>>>>>>> update wins (which I'll call a resolution). >>>>>>>>>>>>>> I understand what you are saying, and yes semantics is very >>>>>>>>>>>>>> important >>>>>>>>>>>>>> here. And yes we are responding to the immediate questions >>>>>>>>>>>>>> without >>>>>>>>>>>>>> covering >>>>>>>>>>>>>> all questions in the thread. >>>>>>>>>>>>>> The point being made here is that the timestamp of the column is >>>>>>>>>>>>>> not >>>>>>>>>>>>>> used by Cassandra to figure out what data to return. >>>>>>>>>>>>> >>>>>>>>>>>>> Not quite true. >>>>>>>>>>>>>> >>>>>>>>>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3 >>>>>>>>>>>>>> A Quorum Write comes and add/updates the time stamp (TS2) of a >>>>>>>>>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So >>>>>>>>>>>>>> the >>>>>>>>>>>>>> write is >>>>>>>>>>>>>> returned as failed - right ? >>>>>>>>>>>>>> Now Quorum read comes in for exactly the same piece of data that >>>>>>>>>>>>>> the >>>>>>>>>>>>>> write failed for. >>>>>>>>>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1) >>>>>>>>>>>>>> And the read succeeds - Will it return TS1 or TS2. >>>>>>>>>>>>>> I submit it will return TS1 - the old TS. >>>>>>>>>>>>> >>>>>>>>>>>>> It all depends on which (first 2) nodes respond to the read (since >>>>>>>>>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that >>>>>>>>>>>>> makes the >>>>>>>>>>>>> quorum, then TS2 will be returned, because cassandra will compare >>>>>>>>>>>>> the >>>>>>>>>>>>> timestamp and decide what to return based on this. If N2/N3 >>>>>>>>>>>>> responds >>>>>>>>>>>>> however, both timestamp will be TS1 and so, after timestamp >>>>>>>>>>>>> resolution, it >>>>>>>>>>>>> will stil be TS1 that will be returned. >>>>>>>>>>>>> So yes timestamp is used for conflict resolution. >>>>>>>>>>>>> In your example, you could get TS1 back because a failed write can >>>>>>>>>>>>> let >>>>>>>>>>>>> you cluster in an inconsistent state. You'd have to retry the >>>>>>>>>>>>> quorum and >>>>>>>>>>>>> only when it succeeds can you be guaranteed that quorum read will >>>>>>>>>>>>> always >>>>>>>>>>>>> return TS2. >>>>>>>>>>>>> This is because when a write fails, Cassandra doesn't guarantee >>>>>>>>>>>>> that >>>>>>>>>>>>> the write did not made it in (there is no revert). >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Are we on the same page with this interpretation ? >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> -JA >>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne >>>>>>>>>>>>>> <sylv...@datastax.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John >>>>>>>>>>>>>>> <chirayit...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Sylvan, >>>>>>>>>>>>>>>> Time stamps are not used for conflict resolution - unless is is >>>>>>>>>>>>>>>> part of the application logic!!! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> What is you definition of conflict resolution ? Because if you >>>>>>>>>>>>>>> update twice the same column (which >>>>>>>>>>>>>>> I'll call a conflict), then the timestamps are used to decide >>>>>>>>>>>>>>> which >>>>>>>>>>>>>>> update wins (which I'll call a resolution). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd >>>>>>>>>>>>>>>> products - cages for e.g. - to get ACID type consistency. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Then again, you'll have to define what you are calling "lost >>>>>>>>>>>>>>> updates". Provided you use a reasonable consistency level, >>>>>>>>>>>>>>> Cassandra >>>>>>>>>>>>>>> provides fairly strong durability guarantee, so for some >>>>>>>>>>>>>>> definition you >>>>>>>>>>>>>>> don't "lose updates". >>>>>>>>>>>>>>> That being said, I never pretended that Cassandra provided any >>>>>>>>>>>>>>> ACID >>>>>>>>>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't >>>>>>>>>>>>>>> support. If >>>>>>>>>>>>>>> we're talking about the guarantees of transaction, then by all >>>>>>>>>>>>>>> means, >>>>>>>>>>>>>>> cassandra won't provide it. And yes you can use cages or the >>>>>>>>>>>>>>> like >>>>>>>>>>>>>>> to get >>>>>>>>>>>>>>> transaction. But that was not the point of the thread, was it ? >>>>>>>>>>>>>>> The thread >>>>>>>>>>>>>>> is about vector clocks, and that has nothing to do with >>>>>>>>>>>>>>> transaction (vector >>>>>>>>>>>>>>> clocks certainly don't give you transactions). >>>>>>>>>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to >>>>>>>>>>>>>>> why >>>>>>>>>>>>>>> so far I don't think vector clocks would really provide much for >>>>>>>>>>>>>>> Cassandra. >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Sylvain >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -JA >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne >>>>>>>>>>>>>>>> <sylv...@datastax.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John >>>>>>>>>>>>>>>>> <chirayit...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Apologies : For some reason my response on the original mail >>>>>>>>>>>>>>>>>> keeps bouncing back, thus this new one! >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> From the other hand, the same article says: >>>>>>>>>>>>>>>>>>> "For conditional writes to work, the condition must be >>>>>>>>>>>>>>>>>>> evaluated at all update >>>>>>>>>>>>>>>>>>> sites before the write can be allowed to succeed." >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> This means, that when doing such an update CL=ALL must be >>>>>>>>>>>>>>>>>>> used >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Sorry, but I am confused by that entire thread! >>>>>>>>>>>>>>>>>> Questions:- >>>>>>>>>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any >>>>>>>>>>>>>>>>>> granularity whether it be row/colF/Col ? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> No locking, no. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent >>>>>>>>>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of >>>>>>>>>>>>>>>>>> data on different >>>>>>>>>>>>>>>>>> nodes can still mess each other up, right ? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any >>>>>>>>>>>>>>>>> CL, >>>>>>>>>>>>>>>>> updating the same piece of data means the same column value. >>>>>>>>>>>>>>>>> In >>>>>>>>>>>>>>>>> that case, >>>>>>>>>>>>>>>>> the resolution rules are the following: >>>>>>>>>>>>>>>>> - If the updates have a different timestamp, keep the one >>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>> the higher timestamp. That is, the more recent of two updates >>>>>>>>>>>>>>>>> win. >>>>>>>>>>>>>>>>> - It the timestamps are the same, then it compares the >>>>>>>>>>>>>>>>> values >>>>>>>>>>>>>>>>> (byte comparison) and keep the highest value. This is just to >>>>>>>>>>>>>>>>> break ties in >>>>>>>>>>>>>>>>> a consistent manner. >>>>>>>>>>>>>>>>> So if you do two truly concurrent updates (that is from two >>>>>>>>>>>>>>>>> place >>>>>>>>>>>>>>>>> at the same instant), then you'll end with one of the update. >>>>>>>>>>>>>>>>> This is the >>>>>>>>>>>>>>>>> column level. >>>>>>>>>>>>>>>>> However, if that simple conflict detection/resolution >>>>>>>>>>>>>>>>> mechanism >>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>> not good enough for some of your use case and you need to keep >>>>>>>>>>>>>>>>> two >>>>>>>>>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the >>>>>>>>>>>>>>>>> update don't >>>>>>>>>>>>>>>>> end up in the same column. This is easily achieved by >>>>>>>>>>>>>>>>> appending >>>>>>>>>>>>>>>>> some unique >>>>>>>>>>>>>>>>> identifier to the column name for instance. And when reading, >>>>>>>>>>>>>>>>> do a slice and >>>>>>>>>>>>>>>>> reconcile whatever you get back with whatever logic make >>>>>>>>>>>>>>>>> sense. >>>>>>>>>>>>>>>>> If you do >>>>>>>>>>>>>>>>> that, congrats, you've roughly emulated what vector clocks >>>>>>>>>>>>>>>>> would do. Btw, no >>>>>>>>>>>>>>>>> locking or anything needed. >>>>>>>>>>>>>>>>> In my experience, for most things the timestamp resolution is >>>>>>>>>>>>>>>>> enough. If the same user update twice it's profile picture on >>>>>>>>>>>>>>>>> you web site >>>>>>>>>>>>>>>>> at the same microsecond, it's usually fine to end up with one >>>>>>>>>>>>>>>>> of the two >>>>>>>>>>>>>>>>> pictures. In the rare case where you need something more >>>>>>>>>>>>>>>>> specific, using the >>>>>>>>>>>>>>>>> cassandra data model usually solves the problem easily. The >>>>>>>>>>>>>>>>> reason for not >>>>>>>>>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't >>>>>>>>>>>>>>>>> really found >>>>>>>>>>>>>>>>> much example where it is no the case. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Sylvain >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> Just to make a note the "EVENTUAL" in eventual consistency could be a >>>> time that is less then 1ms. >>>> >>>> I have a program that demonstrates that "eventual" means if i write >>>> data at the weakest level, and read it back from a random another node >>>> as soon as possible. 99% I see the update. I can share the code if you >>>> would like. >>>> >>>> Remember http://en.wikipedia.org/wiki/Spacetime >>>> ...but there is no reference frame in which the two events can occur >>>> at the same time... >>>> >>>> As to MongoDB references ....Yes! most of the noSQL work differently. >>>> They each approach CAP >>>> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem in a >>>> different way. >>>> >>>> Cassandra does not lock (it is no secret). But remember, you can not >>>> have it all pick 2/3 from CAP. >>>> >>> >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of DataStax, the source for professional Cassandra support >> http://www.datastax.com >>