>>but could be broken in case of a failed write<< You can think of a scenario where R + W >N still leads to inconsistency even for successful writes. Say you keep W=1 and R=N . Lets say the one node where a write happened with success goes down before it made to the other N-1 nodes. Lets say it goes down for good and is unrecoverable. The only option is to build a new node from scratch from other active nodes. This will lead to a write that was lost and you will end up serving stale copy of it.
It is better to talk in terms of use cases and if cassandra will be a fit for it. Otherwise unless you have W=R=N and fsync before each write commit, there will be scope for inconsistency. On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <chirayit...@gmail.com> wrote: > I see the point - apologies for putting everyone through this! > It was just militating against my mental model. > In summary, here is my take away - simple stuff but - IMO - important to > conclude this thread (I hope):- > 1. I was splitting hair over a failed ( partial ) Q Write. Such an event > should be immediately followed by the same write going to a connection on to > another node ( potentially using connection caches of client implementations > ) or a Read at CL of All. Because a write could have partially gone through. > 2. Timestamps are used in determining the latest version ( correcting the > false impression I was propagating) > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken in > case of a failed write as it is unsure whether the new value got written on > any server or not. Is that a fair characterization ? > Bottom line - unlike traditional DBMS, errors do not ensure automatic > cleanup and revert back, app code has to follow up if immediate - and not > eventual - consistency is desired. I made that leap in almost all cases - I > think - but the case of a failed write. > My bad and I can live with this! > Regards, > -JA > > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne <sylv...@datastax.com> > wrote: >> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <chirayit...@gmail.com> >> wrote: >>> >>> Completely understand! >>> All that I am quibbling over is whether a CL of quorum guarantees >>> consistency or not. That is what the documentation says - right. IF for a CL >>> of Q read - it depends on which node returns read first to determine the >>> actual returned result or other more convoluted conditions , then a Quorum >>> read/write is not consistent, by any definition. >> >> But that's the point. The definition of consistency we are talking about >> has no meaning if you consider only a quorum read. The definition (which is >> the de facto definition of consistency in 'eventually consistent') make >> sense if we talk about a write followed by a read. And it is >> considering succeeding write followed by succeeding read. >> And that is the statement the wiki is making. >> Honestly, we could debate forever on the definition of consistency and >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W >> replica and then a (succeeding) read on R replica and if R+W>N, then it is >> guaranteed that the read will see the preceding write. And this is what is >> called consistency in the context of eventual consistency (which is not the >> context of ACID). >> If this is not the definition of consistency you had in mind then by all >> mean, Cassandra probably don't guarantee this definition. But given that the >> paragraph preceding what you pasted state clearly we are not talking about >> ACID consistency, but eventual consistency, I don't think the wiki is making >> any unfair statement. >> That being said, the wiki may not be always as clear as it could. But it's >> an editable wiki :) >> -- >> Sylvain >> >>> >>> I can still use Cassandra, and will use it, luv it!!! But let us not make >>> this statement on the Wiki architecture section:- >>> ------------------------------------------------------------- >>> >>> More specifically: R=read replica count W=write replica >>> count N=replication factor Q=QUORUM (Q = N / 2 + 1) >>> >>> If W + R > N, you will have consistency >>> >>> W=1, R=N >>> W=N, R=1 >>> W=Q, R=Q where Q = N / 2 + 1 >>> >>> Cassandra provides consistency when R + W > N (read replica count + write >>> replica count > replication factor). >>> >>> ---------------------------------------------------- >>> >>> . >>> >>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne <sylv...@datastax.com> >>> wrote: >>>> >>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <chirayit...@gmail.com> >>>> wrote: >>>>> >>>>> If you are correct and you are probably closer to the code - then CL of >>>>> Quorum does not guarantee a consistency. >>>> >>>> If the operation succeed, it does (for some definition of consistency >>>> which is, following reads at Quorum will be guaranteed to see the new value >>>> of a update at quorum). If it fails, then no, it does not guarantee >>>> consistency. >>>> It is important to note that the word consistency has multiple meaning. >>>> In particular, when we are talking of consistency in Cassandra, we are not >>>> talking of the same definition as the C in ACID >>>> (see: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html) >>>>> >>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne >>>>> <sylv...@datastax.com> wrote: >>>>>> >>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John <chirayit...@gmail.com> >>>>>> wrote: >>>>>>>> >>>>>>>> >>Time stamps are not used for conflict resolution - unless is is >>>>>>>> >> part of the application logic!!! >>>>>>> >>>>>>> >>What is you definition of conflict resolution ? Because if you >>>>>>> >> update twice the same column (which >>>>>>> >>I'll call a conflict), then the timestamps are used to decide which >>>>>>> >> update wins (which I'll call a resolution). >>>>>>> I understand what you are saying, and yes semantics is very important >>>>>>> here. And yes we are responding to the immediate questions without >>>>>>> covering >>>>>>> all questions in the thread. >>>>>>> The point being made here is that the timestamp of the column is not >>>>>>> used by Cassandra to figure out what data to return. >>>>>> >>>>>> Not quite true. >>>>>>> >>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3 >>>>>>> A Quorum Write comes and add/updates the time stamp (TS2) of a >>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the >>>>>>> write is >>>>>>> returned as failed - right ? >>>>>>> Now Quorum read comes in for exactly the same piece of data that the >>>>>>> write failed for. >>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1) >>>>>>> And the read succeeds - Will it return TS1 or TS2. >>>>>>> I submit it will return TS1 - the old TS. >>>>>> >>>>>> It all depends on which (first 2) nodes respond to the read (since >>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that makes >>>>>> the >>>>>> quorum, then TS2 will be returned, because cassandra will compare the >>>>>> timestamp and decide what to return based on this. If N2/N3 responds >>>>>> however, both timestamp will be TS1 and so, after timestamp resolution, >>>>>> it >>>>>> will stil be TS1 that will be returned. >>>>>> So yes timestamp is used for conflict resolution. >>>>>> In your example, you could get TS1 back because a failed write can let >>>>>> you cluster in an inconsistent state. You'd have to retry the quorum and >>>>>> only when it succeeds can you be guaranteed that quorum read will always >>>>>> return TS2. >>>>>> This is because when a write fails, Cassandra doesn't guarantee that >>>>>> the write did not made it in (there is no revert). >>>>>> >>>>>>> >>>>>>> Are we on the same page with this interpretation ? >>>>>>> Regards, >>>>>>> -JA >>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne >>>>>>> <sylv...@datastax.com> wrote: >>>>>>>> >>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John >>>>>>>> <chirayit...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Sylvan, >>>>>>>>> Time stamps are not used for conflict resolution - unless is is >>>>>>>>> part of the application logic!!! >>>>>>>> >>>>>>>> What is you definition of conflict resolution ? Because if you >>>>>>>> update twice the same column (which >>>>>>>> I'll call a conflict), then the timestamps are used to decide which >>>>>>>> update wins (which I'll call a resolution). >>>>>>>> >>>>>>>>> >>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd >>>>>>>>> products - cages for e.g. - to get ACID type consistency. >>>>>>>> >>>>>>>> Then again, you'll have to define what you are calling "lost >>>>>>>> updates". Provided you use a reasonable consistency level, Cassandra >>>>>>>> provides fairly strong durability guarantee, so for some definition you >>>>>>>> don't "lose updates". >>>>>>>> That being said, I never pretended that Cassandra provided any ACID >>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't >>>>>>>> support. If >>>>>>>> we're talking about the guarantees of transaction, then by all means, >>>>>>>> cassandra won't provide it. And yes you can use cages or the like to >>>>>>>> get >>>>>>>> transaction. But that was not the point of the thread, was it ? The >>>>>>>> thread >>>>>>>> is about vector clocks, and that has nothing to do with transaction >>>>>>>> (vector >>>>>>>> clocks certainly don't give you transactions). >>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to why >>>>>>>> so far I don't think vector clocks would really provide much for >>>>>>>> Cassandra. >>>>>>>> -- >>>>>>>> Sylvain >>>>>>>> >>>>>>>>> >>>>>>>>> -JA >>>>>>>>> >>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne >>>>>>>>> <sylv...@datastax.com> wrote: >>>>>>>>>> >>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John >>>>>>>>>> <chirayit...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>> Apologies : For some reason my response on the original mail >>>>>>>>>>> keeps bouncing back, thus this new one! >>>>>>>>>>> >>>>>>>>>>> > From the other hand, the same article says: >>>>>>>>>>> > "For conditional writes to work, the condition must be >>>>>>>>>>> > evaluated at all update >>>>>>>>>>> > sites before the write can be allowed to succeed." >>>>>>>>>>> > >>>>>>>>>>> > This means, that when doing such an update CL=ALL must be used >>>>>>>>>>> >>>>>>>>>>> Sorry, but I am confused by that entire thread! >>>>>>>>>>> Questions:- >>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any >>>>>>>>>>> granularity whether it be row/colF/Col ? >>>>>>>>>> >>>>>>>>>> No locking, no. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent >>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of data on >>>>>>>>>>> different >>>>>>>>>>> nodes can still mess each other up, right ? >>>>>>>>>> >>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL, >>>>>>>>>> updating the same piece of data means the same column value. In that >>>>>>>>>> case, >>>>>>>>>> the resolution rules are the following: >>>>>>>>>> - If the updates have a different timestamp, keep the one with >>>>>>>>>> the higher timestamp. That is, the more recent of two updates win. >>>>>>>>>> - It the timestamps are the same, then it compares the values >>>>>>>>>> (byte comparison) and keep the highest value. This is just to break >>>>>>>>>> ties in >>>>>>>>>> a consistent manner. >>>>>>>>>> So if you do two truly concurrent updates (that is from two place >>>>>>>>>> at the same instant), then you'll end with one of the update. This >>>>>>>>>> is the >>>>>>>>>> column level. >>>>>>>>>> However, if that simple conflict detection/resolution mechanism is >>>>>>>>>> not good enough for some of your use case and you need to keep two >>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the >>>>>>>>>> update don't >>>>>>>>>> end up in the same column. This is easily achieved by appending some >>>>>>>>>> unique >>>>>>>>>> identifier to the column name for instance. And when reading, do a >>>>>>>>>> slice and >>>>>>>>>> reconcile whatever you get back with whatever logic make sense. If >>>>>>>>>> you do >>>>>>>>>> that, congrats, you've roughly emulated what vector clocks would do. >>>>>>>>>> Btw, no >>>>>>>>>> locking or anything needed. >>>>>>>>>> In my experience, for most things the timestamp resolution is >>>>>>>>>> enough. If the same user update twice it's profile picture on you >>>>>>>>>> web site >>>>>>>>>> at the same microsecond, it's usually fine to end up with one of the >>>>>>>>>> two >>>>>>>>>> pictures. In the rare case where you need something more specific, >>>>>>>>>> using the >>>>>>>>>> cassandra data model usually solves the problem easily. The reason >>>>>>>>>> for not >>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't really >>>>>>>>>> found >>>>>>>>>> much example where it is no the case. >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Sylvain >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > >