Re: New Chain for : Does Cassandra use vector clocks

A J Fri, 25 Feb 2011 07:08:39 -0800

He has a product to sell, so you can expect some advertising. But in
general, Stonebraker's articles are very deep (another one that
challenges general conceptions is
http://voltdb.com/voltdb-webinar-sql-urban-myths ) . He is the creator
of Postgres and considered a guru in databases by many.
And actually if you cannot let go of ACID and not satisfied with
traditional DBMS solutions, voltdb is worth considering. It ofcourse
solves a different problem(oltp) than what Cassandra does.



On Thu, Feb 24, 2011 at 5:20 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote:
> On Thu, Feb 24, 2011 at 3:56 PM, A J <s5a...@gmail.com> wrote:
>> While we are at it, there's more to consider than just CAP in distributed :)
>> http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors
>>
>> On Thu, Feb 24, 2011 at 3:31 PM, Edward Capriolo <edlinuxg...@gmail.com> 
>> wrote:
>>> On Thu, Feb 24, 2011 at 3:03 PM, A J <s5a...@gmail.com> wrote:
>>>> yes, that is difficult to digest and one has to be sure if the use
>>>> case can afford it.
>>>>
>>>> Some other NOSQL databases deals with it differently (though I don't
>>>> think any of them use atomic 2-phase commit). MongoDB for example will
>>>> ask you to read from the node you wrote first (primary node) unless
>>>> you are ok with eventual consistency. If the write did not make to
>>>> majority of other nodes, it will be rolled-back from the original
>>>> primary when it comes up again as a secondary.
>>>> In some cases, you still could server either new value (that was
>>>> returned as failed) or the old one. But it is different from Cassandra
>>>> in the sense that Cassandra will never rollback.
>>>>
>>>>
>>>>
>>>> On Thu, Feb 24, 2011 at 2:47 PM, Anthony John <chirayit...@gmail.com> 
>>>> wrote:
>>>>> The leap of faith here is that an error does not mean a clean backing out 
>>>>> to
>>>>> prior state - as we are used to with databases. It means that the 
>>>>> operation
>>>>> in error could have gone through partially
>>>>>
>>>>> Again, this is not an absolutely unfamiliar territory and can be dealt 
>>>>> with.
>>>>> -JA
>>>>> On Thu, Feb 24, 2011 at 1:16 PM, A J <s5a...@gmail.com> wrote:
>>>>>>
>>>>>> >>but could be broken in case of a failed write<<
>>>>>> You can think of a scenario where R + W >N still leads to
>>>>>> inconsistency even for successful writes. Say you keep W=1 and R=N .
>>>>>> Lets say the one node where a write happened with success goes down
>>>>>> before it made to the other N-1 nodes. Lets say it goes down for good
>>>>>> and is unrecoverable. The only option is to build a new node from
>>>>>> scratch from other active nodes. This will lead to a write that was
>>>>>> lost and you will end up serving stale copy of it.
>>>>>>
>>>>>> It is better to talk in terms of use cases and if cassandra will be a
>>>>>> fit for it. Otherwise unless you have W=R=N and fsync before each
>>>>>> write commit, there will be scope for inconsistency.
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <chirayit...@gmail.com>
>>>>>> wrote:
>>>>>> > I see the point - apologies for putting everyone through this!
>>>>>> > It was just militating against my mental model.
>>>>>> > In summary, here is my take away - simple stuff but - IMO - important 
>>>>>> > to
>>>>>> > conclude this thread (I hope):-
>>>>>> > 1. I was splitting hair over a failed ( partial ) Q Write. Such an 
>>>>>> > event
>>>>>> > should be immediately followed by the same write going to a connection
>>>>>> > on to
>>>>>> > another node ( potentially using connection caches of client
>>>>>> > implementations
>>>>>> > ) or a Read at CL of All. Because a write could have partially gone
>>>>>> > through.
>>>>>> > 2. Timestamps are used in determining the latest version ( correcting
>>>>>> > the
>>>>>> > false impression I was propagating)
>>>>>> > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken
>>>>>> > in
>>>>>> > case of a failed write as it is unsure whether the new value got 
>>>>>> > written
>>>>>> > on
>>>>>> >  any server or not. Is that a fair characterization ?
>>>>>> > Bottom line - unlike traditional DBMS, errors do not ensure automatic
>>>>>> > cleanup and revert back, app code has to follow up if  immediate - and
>>>>>> > not
>>>>>> > eventual -  consistency is desired. I made that leap in almost all 
>>>>>> > cases
>>>>>> > - I
>>>>>> > think - but the case of a failed write.
>>>>>> > My bad and I can live with this!
>>>>>> > Regards,
>>>>>> > -JA
>>>>>> >
>>>>>> > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
>>>>>> > <sylv...@datastax.com>
>>>>>> > wrote:
>>>>>> >>
>>>>>> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <chirayit...@gmail.com>
>>>>>> >> wrote:
>>>>>> >>>
>>>>>> >>> Completely understand!
>>>>>> >>> All that I am quibbling over is whether a CL of quorum guarantees
>>>>>> >>> consistency or not. That is what the documentation says - right. IF
>>>>>> >>> for a CL
>>>>>> >>> of Q read - it depends on which node returns read first to determine
>>>>>> >>> the
>>>>>> >>> actual returned result or other more convoluted conditions , then a
>>>>>> >>> Quorum
>>>>>> >>> read/write is not consistent, by any definition.
>>>>>> >>
>>>>>> >> But that's the point. The definition of consistency we are talking
>>>>>> >> about
>>>>>> >> has no meaning if you consider only a quorum read. The definition
>>>>>> >> (which is
>>>>>> >> the de facto definition of consistency in 'eventually consistent') 
>>>>>> >> make
>>>>>> >> sense if we talk about a write followed by a read. And it is
>>>>>> >> considering succeeding write followed by succeeding read.
>>>>>> >> And that is the statement the wiki is making.
>>>>>> >> Honestly, we could debate forever on the definition of consistency and
>>>>>> >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
>>>>>> >> replica and then a (succeeding) read on R replica and if R+W>N, then 
>>>>>> >> it
>>>>>> >> is
>>>>>> >> guaranteed that the read will see the preceding write. And this is 
>>>>>> >> what
>>>>>> >> is
>>>>>> >> called consistency in the context of eventual consistency (which is 
>>>>>> >> not
>>>>>> >> the
>>>>>> >> context of ACID).
>>>>>> >> If this is not the definition of consistency you had in mind then by
>>>>>> >> all
>>>>>> >> mean, Cassandra probably don't guarantee this definition. But given
>>>>>> >> that the
>>>>>> >> paragraph preceding what you pasted state clearly we are not talking
>>>>>> >> about
>>>>>> >> ACID consistency, but eventual consistency, I don't think the wiki is
>>>>>> >> making
>>>>>> >> any unfair statement.
>>>>>> >> That being said, the wiki may not be always as clear as it could. But
>>>>>> >> it's
>>>>>> >> an editable wiki :)
>>>>>> >> --
>>>>>> >> Sylvain
>>>>>> >>
>>>>>> >>>
>>>>>> >>> I can still use Cassandra, and will use it, luv it!!! But let us not
>>>>>> >>> make
>>>>>> >>> this statement on the Wiki architecture section:-
>>>>>> >>> -------------------------------------------------------------
>>>>>> >>>
>>>>>> >>> More specifically: R=read replica count W=write replica
>>>>>> >>> count N=replication factor Q=QUORUM (Q = N / 2 + 1)
>>>>>> >>>
>>>>>> >>> If W + R > N, you will have consistency
>>>>>> >>>
>>>>>> >>> W=1, R=N
>>>>>> >>> W=N, R=1
>>>>>> >>> W=Q, R=Q where Q = N / 2 + 1
>>>>>> >>>
>>>>>> >>> Cassandra provides consistency when R + W > N (read replica count
>>>>>> >>> + write
>>>>>> >>> replica count > replication factor).
>>>>>> >>>
>>>>>> >>> ----------------------------------------------------
>>>>>> >>>
>>>>>> >>> .
>>>>>> >>>
>>>>>> >>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne
>>>>>> >>> <sylv...@datastax.com>
>>>>>> >>> wrote:
>>>>>> >>>>
>>>>>> >>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John 
>>>>>> >>>> <chirayit...@gmail.com>
>>>>>> >>>> wrote:
>>>>>> >>>>>
>>>>>> >>>>> If you are correct and you are probably closer to the code - then 
>>>>>> >>>>> CL
>>>>>> >>>>> of
>>>>>> >>>>> Quorum does not guarantee a consistency.
>>>>>> >>>>
>>>>>> >>>> If the operation succeed, it does (for some definition of 
>>>>>> >>>> consistency
>>>>>> >>>> which is, following reads at Quorum will be guaranteed to see the 
>>>>>> >>>> new
>>>>>> >>>> value
>>>>>> >>>> of a update at quorum). If it fails, then no, it does not guarantee
>>>>>> >>>> consistency.
>>>>>> >>>> It is important to note that the word consistency has multiple
>>>>>> >>>> meaning.
>>>>>> >>>> In particular, when we are talking of consistency in Cassandra, we
>>>>>> >>>> are not
>>>>>> >>>> talking of the same definition as the C in ACID
>>>>>> >>>>
>>>>>> >>>> (see: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>>>>>> >>>>>
>>>>>> >>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne
>>>>>> >>>>> <sylv...@datastax.com> wrote:
>>>>>> >>>>>>
>>>>>> >>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John
>>>>>> >>>>>> <chirayit...@gmail.com>
>>>>>> >>>>>> wrote:
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> >>Time stamps are not used for conflict resolution - unless is 
>>>>>> >>>>>>>> >>is
>>>>>> >>>>>>>> >> part of the application logic!!!
>>>>>> >>>>>>>
>>>>>> >>>>>>> >>What is you definition of conflict resolution ? Because if you
>>>>>> >>>>>>> >> update twice the same column (which
>>>>>> >>>>>>> >>I'll call a conflict), then the timestamps are used to decide
>>>>>> >>>>>>> >> which
>>>>>> >>>>>>> >> update wins (which I'll call a resolution).
>>>>>> >>>>>>> I understand what you are saying, and yes semantics is very
>>>>>> >>>>>>> important
>>>>>> >>>>>>> here. And yes we are responding to the immediate questions 
>>>>>> >>>>>>> without
>>>>>> >>>>>>> covering
>>>>>> >>>>>>> all questions in the thread.
>>>>>> >>>>>>> The point being made here is that the timestamp of the column is
>>>>>> >>>>>>> not
>>>>>> >>>>>>> used by Cassandra to figure out what data to return.
>>>>>> >>>>>>
>>>>>> >>>>>> Not quite true.
>>>>>> >>>>>>>
>>>>>> >>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>>>>>> >>>>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
>>>>>> >>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So 
>>>>>> >>>>>>> the
>>>>>> >>>>>>> write is
>>>>>> >>>>>>> returned as failed - right ?
>>>>>> >>>>>>> Now Quorum read comes in for exactly the same piece of data that
>>>>>> >>>>>>> the
>>>>>> >>>>>>> write failed for.
>>>>>> >>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>>>>>> >>>>>>> And the read succeeds - Will it return TS1 or TS2.
>>>>>> >>>>>>> I submit it will return TS1 - the old TS.
>>>>>> >>>>>>
>>>>>> >>>>>> It all depends on which (first 2) nodes respond to the read (since
>>>>>> >>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that
>>>>>> >>>>>> makes the
>>>>>> >>>>>> quorum, then TS2 will be returned, because cassandra will compare
>>>>>> >>>>>> the
>>>>>> >>>>>> timestamp and decide what to return based on this. If N2/N3
>>>>>> >>>>>> responds
>>>>>> >>>>>> however, both timestamp will be TS1 and so, after timestamp
>>>>>> >>>>>> resolution, it
>>>>>> >>>>>> will stil be TS1 that will be returned.
>>>>>> >>>>>> So yes timestamp is used for conflict resolution.
>>>>>> >>>>>> In your example, you could get TS1 back because a failed write can
>>>>>> >>>>>> let
>>>>>> >>>>>> you cluster in an inconsistent state. You'd have to retry the
>>>>>> >>>>>> quorum and
>>>>>> >>>>>> only when it succeeds can you be guaranteed that quorum read will
>>>>>> >>>>>> always
>>>>>> >>>>>> return TS2.
>>>>>> >>>>>> This is because when a write fails, Cassandra doesn't guarantee
>>>>>> >>>>>> that
>>>>>> >>>>>> the write did not made it in (there is no revert).
>>>>>> >>>>>>
>>>>>> >>>>>>>
>>>>>> >>>>>>> Are we on the same page with this interpretation ?
>>>>>> >>>>>>> Regards,
>>>>>> >>>>>>> -JA
>>>>>> >>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne
>>>>>> >>>>>>> <sylv...@datastax.com> wrote:
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John
>>>>>> >>>>>>>> <chirayit...@gmail.com> wrote:
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> Sylvan,
>>>>>> >>>>>>>>> Time stamps are not used for conflict resolution - unless is is
>>>>>> >>>>>>>>> part of the application logic!!!
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> What is you definition of conflict resolution ? Because if you
>>>>>> >>>>>>>> update twice the same column (which
>>>>>> >>>>>>>> I'll call a conflict), then the timestamps are used to decide
>>>>>> >>>>>>>> which
>>>>>> >>>>>>>> update wins (which I'll call a resolution).
>>>>>> >>>>>>>>
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
>>>>>> >>>>>>>>> products - cages for e.g. - to get ACID type consistency.
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> Then again, you'll have to define what you are calling "lost
>>>>>> >>>>>>>> updates". Provided you use a reasonable consistency level,
>>>>>> >>>>>>>> Cassandra
>>>>>> >>>>>>>> provides fairly strong durability guarantee, so for some
>>>>>> >>>>>>>> definition you
>>>>>> >>>>>>>> don't "lose updates".
>>>>>> >>>>>>>> That being said, I never pretended that Cassandra provided any
>>>>>> >>>>>>>> ACID
>>>>>> >>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't
>>>>>> >>>>>>>> support. If
>>>>>> >>>>>>>> we're talking about the guarantees of transaction, then by all
>>>>>> >>>>>>>> means,
>>>>>> >>>>>>>> cassandra won't provide it. And yes you can use cages or the 
>>>>>> >>>>>>>> like
>>>>>> >>>>>>>> to get
>>>>>> >>>>>>>> transaction. But that was not the point of the thread, was it ?
>>>>>> >>>>>>>> The thread
>>>>>> >>>>>>>> is about vector clocks, and that has nothing to do with
>>>>>> >>>>>>>> transaction (vector
>>>>>> >>>>>>>> clocks certainly don't give you transactions).
>>>>>> >>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to
>>>>>> >>>>>>>> why
>>>>>> >>>>>>>> so far I don't think vector clocks would really provide much for
>>>>>> >>>>>>>> Cassandra.
>>>>>> >>>>>>>> --
>>>>>> >>>>>>>> Sylvain
>>>>>> >>>>>>>>
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> -JA
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne
>>>>>> >>>>>>>>> <sylv...@datastax.com> wrote:
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John
>>>>>> >>>>>>>>>> <chirayit...@gmail.com> wrote:
>>>>>> >>>>>>>>>>>
>>>>>> >>>>>>>>>>> Apologies : For some reason my response on the original mail
>>>>>> >>>>>>>>>>> keeps bouncing back, thus this new one!
>>>>>> >>>>>>>>>>>
>>>>>> >>>>>>>>>>> > From the other hand, the same article says:
>>>>>> >>>>>>>>>>> > "For conditional writes to work, the condition must be
>>>>>> >>>>>>>>>>> > evaluated at all update
>>>>>> >>>>>>>>>>> > sites before the write can be allowed to succeed."
>>>>>> >>>>>>>>>>> >
>>>>>> >>>>>>>>>>> > This means, that when doing such an update CL=ALL must be
>>>>>> >>>>>>>>>>> > used
>>>>>> >>>>>>>>>>>
>>>>>> >>>>>>>>>>> Sorry, but I am confused by that entire thread!
>>>>>> >>>>>>>>>>> Questions:-
>>>>>> >>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>>>> >>>>>>>>>>> granularity whether it be row/colF/Col ?
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>> No locking, no.
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>>>
>>>>>> >>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>>>>>> >>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of
>>>>>> >>>>>>>>>>> data on different
>>>>>> >>>>>>>>>>> nodes can still mess each other up, right ?
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any 
>>>>>> >>>>>>>>>> CL,
>>>>>> >>>>>>>>>> updating the same piece of data means the same column value. 
>>>>>> >>>>>>>>>> In
>>>>>> >>>>>>>>>> that case,
>>>>>> >>>>>>>>>> the resolution rules are the following:
>>>>>> >>>>>>>>>>   - If the updates have a different timestamp, keep the one
>>>>>> >>>>>>>>>> with
>>>>>> >>>>>>>>>> the higher timestamp. That is, the more recent of two updates
>>>>>> >>>>>>>>>> win.
>>>>>> >>>>>>>>>>   - It the timestamps are the same, then it compares the 
>>>>>> >>>>>>>>>> values
>>>>>> >>>>>>>>>> (byte comparison) and keep the highest value. This is just to
>>>>>> >>>>>>>>>> break ties in
>>>>>> >>>>>>>>>> a consistent manner.
>>>>>> >>>>>>>>>> So if you do two truly concurrent updates (that is from two
>>>>>> >>>>>>>>>> place
>>>>>> >>>>>>>>>> at the same instant), then you'll end with one of the update.
>>>>>> >>>>>>>>>> This is the
>>>>>> >>>>>>>>>> column level.
>>>>>> >>>>>>>>>> However, if that simple conflict detection/resolution 
>>>>>> >>>>>>>>>> mechanism
>>>>>> >>>>>>>>>> is
>>>>>> >>>>>>>>>> not good enough for some of your use case and you need to keep
>>>>>> >>>>>>>>>> two
>>>>>> >>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the
>>>>>> >>>>>>>>>> update don't
>>>>>> >>>>>>>>>> end up in the same column. This is easily achieved by 
>>>>>> >>>>>>>>>> appending
>>>>>> >>>>>>>>>> some unique
>>>>>> >>>>>>>>>> identifier to the column name for instance. And when reading,
>>>>>> >>>>>>>>>> do a slice and
>>>>>> >>>>>>>>>> reconcile whatever you get back with whatever logic make 
>>>>>> >>>>>>>>>> sense.
>>>>>> >>>>>>>>>> If you do
>>>>>> >>>>>>>>>> that, congrats, you've roughly emulated what vector clocks
>>>>>> >>>>>>>>>> would do. Btw, no
>>>>>> >>>>>>>>>> locking or anything needed.
>>>>>> >>>>>>>>>> In my experience, for most things the timestamp resolution is
>>>>>> >>>>>>>>>> enough. If the same user update twice it's profile picture on
>>>>>> >>>>>>>>>> you web site
>>>>>> >>>>>>>>>> at the same microsecond, it's usually fine to end up with one
>>>>>> >>>>>>>>>> of the two
>>>>>> >>>>>>>>>> pictures. In the rare case where you need something more
>>>>>> >>>>>>>>>> specific, using the
>>>>>> >>>>>>>>>> cassandra data model usually solves the problem easily. The
>>>>>> >>>>>>>>>> reason for not
>>>>>> >>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't
>>>>>> >>>>>>>>>> really found
>>>>>> >>>>>>>>>> much example where it is no the case.
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>> --
>>>>>> >>>>>>>>>> Sylvain
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>
>>>>>> >>>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>
>>>>>> >>>>
>>>>>> >>>
>>>>>> >>
>>>>>> >
>>>>>> >
>>>>>
>>>>>
>>>>
>>>
>>>
>>> Just to make a note the "EVENTUAL" in eventual consistency could be a
>>> time that is less then 1ms.
>>>
>>> I have a program that demonstrates that "eventual" means if i write
>>> data at the weakest level, and read it back from a random another node
>>> as soon as possible. 99% I see the update. I can share the code if you
>>> would like.
>>>
>>> Remember http://en.wikipedia.org/wiki/Spacetime
>>> ...but there is no reference frame in which the two events can occur
>>> at the same time...
>>>
>>> As to MongoDB references ....Yes! most of the noSQL work differently.
>>> They each approach CAP
>>> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem in a
>>> different way.
>>>
>>> Cassandra does not lock (it is no secret). But remember, you can not
>>> have it all pick 2/3 from CAP.
>>>
>>
>
> http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors
> I was reading that and many of the points were well taken...up until...
>
> Next generation DBMS technologies, such as VoltDB, have been shown to
> run around 50X the speed of conventional SQL engines.  Thus, if you
> need 200 nodes to support a specific SQL application, then VoltDB can
> probably do the same application on 4 nodes.  The probability of a
> failure on 200 nodes is wildly different than the probability of
> failure on four nodes.
>
> Come on? 200 nodes down to 4? I just can not take it seriously any more.
>

Re: New Chain for : Does Cassandra use vector clocks

Reply via email to