Re: How does cassandra achieve Linearizability?

Kant Kodali Wed, 22 Feb 2017 17:29:00 -0800

I hope that patch is reviewed as quickly as possible. We use LWT's heavily
and we are getting a throughput of 600 writes/sec and each write is 1KB in
our case.






On Wed, Feb 22, 2017 at 7:48 AM, Edward Capriolo <[email protected]>
wrote:

>
>
> On Wed, Feb 22, 2017 at 9:47 AM, Ariel Weisberg <[email protected]> wrote:
>
>> Hi,
>>
>> No it's not going to be in 3.11.x. The earliest release it could make it
>> into is 4.0.
>>
>> Ariel
>>
>> On Wed, Feb 22, 2017, at 03:34 AM, Kant Kodali wrote:
>>
>> Hi Ariel,
>>
>> Can we really expect the fix in 3.11.x as the ticket
>> https://issues.apache.org/jira/browse/CASSANDRA-6246
>> <https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22>
>>  says?
>>
>> Thanks,
>> kant
>>
>> On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg <[email protected]>
>> wrote:
>>
>>
>> Hi,
>>
>> That would work and would help a lot with the dueling proposer issue.
>>
>> A lot of the leader election stuff is designed to reduce the number of
>> roundtrips and not just address the dueling proposer issue. Those will have
>> downtime because it's there for correctness. Just adding an affinity for a
>> specific proposer is probably a free lunch.
>>
>> I don't think you can group keys because the Paxos proposals are per
>> partition which is why we get linear scale out for Paxos. I don't believe
>> it's linearizable across multiple partitions. You can use the clustering
>> key and deterministically pick one of the live replicas for that clustering
>> key. Sort the list of replicas by IP, hash the clustering key, use the hash
>> as an index into the list of replicas.
>>
>> Batching is of limited usefulness because we only use Paxos for CAS I
>> think? So in a batch by definition all but one will fail the CAS. This is
>> something where a distinguished coordinator could help by failing the rest
>> of the contending requests more inexpensively than it currently does.
>>
>>
>> Ariel
>>
>> On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote:
>>
>>
>>
>> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg <[email protected]>
>> wrote:
>>
>>
>> Hi,
>>
>> Classic Paxos doesn't have a leader. There are variants on the original
>> Lamport approach that will elect a leader (or some other variation like
>> Mencius) to improve throughput, latency, and performance under contention.
>> Cassandra implements the approach from the beginning of "Paxos Made Simple"
>> (https://goo.gl/SrP0Wb) with no additional optimizations that I am aware
>> of. There is no distinguished proposer (leader).
>>
>> That paper does go on to discuss electing a distinguished proposer, but
>> that was never done for C*. I believe it's not considered a good fit for C*
>> philosophically.
>>
>> Ariel
>>
>> On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote:
>>
>> @Ariel Weisberg EPaxos looks very interesting as it looks like it doesn't
>> need any designated leader for C* but I am assuming the paxos that is
>> implemented today for LWT's requires Leader election and If so, don't we
>> need to have an odd number of nodes or racks or DC's to satisfy N = 2F + 1
>> constraint to tolerate F failures ? I understand it is not needed when not
>> using LWT's since Cassandra is a master-less system.
>>
>> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali <[email protected]> wrote:
>>
>> Thanks Ariel! Yes I knew there are so many variations and optimizations
>> of Paxos. I just wanted to see if we had any plans on improving the
>> existing Paxos implementation and it is great to see the work is under
>> progress! I am going to follow that ticket and read up the references
>> pointed in it
>>
>>
>> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg <[email protected]>
>> wrote:
>>
>>
>> Hi,
>>
>> Cassandra's implementation of Paxos doesn't implement many optimizations
>> that would drastically improve throughput and latency. You need consensus,
>> but it doesn't have to be exorbitantly expensive and fall over under any
>> kind of contention.
>>
>> For instance you could implement EPaxos https://issues.apache.o
>> rg/jira/browse/CASSANDRA-6246
>> <https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22>,
>> batch multiple operations into the same Paxos round, have an affinity for a
>> specific proposer for a specific partition, implement asynchronous commit,
>> use a more efficient implementation of the Paxos log, and maybe other
>> things.
>>
>>
>> Ariel
>>
>>
>>
>> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote:
>>
>> Hi Kant,
>>
>> If you read the published papers about Paxos, you will most probably
>> recognize that there is no way to "do it better". This is a conceptional
>> thing due to the nature of distributed systems + the CAP theorem.
>> If you want A+P in the triangle, then C is very expensive. CS is made for
>> A+P mostly with tunable C. In ACID databases this is a completely different
>> thing as they are mostly either not partition tolerant, not highly
>> available or not scalable (in a distributed manner, not speaking of
>> "monolithic super servers").
>>
>> There is no free lunch ...
>>
>>
>> 2017-02-10 11:09 GMT+01:00 Kant Kodali <[email protected]>:
>>
>> "That’s the safety blanket everyone wants but is extremely expensive,
>> especially in Cassandra."
>>
>> yes LWT's are expensive. Are there any plans to make this better?
>>
>> On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali <[email protected]> wrote:
>>
>> Hi Jon,
>>
>> Thanks a lot for your response. I am well aware that the LWW != LWT but I
>> was talking more in terms of LWW with respective to LWT's which I believe
>> you answered. so thanks much!
>>
>>
>> kant
>>
>>
>> On Thu, Feb 9, 2017 at 6:01 PM, Jon Haddad <[email protected]>
>> wrote:
>>
>> LWT != Last Write Wins.  They are totally different.
>>
>> LWTs give you (assuming you also read at SERIAL) “atomic consistency”,
>> meaning you are able to perform operations atomically and in isolation.
>> That’s the safety blanket everyone wants but is extremely expensive,
>> especially in Cassandra.  The lightweight part, btw, may be a little
>> optimistic, especially if a key is under contention.  With regard to the
>> “last write” part you’re asking about - w/ LWT Cassandra provides the
>> timestamp and manages it as part of the ballot, and it always is
>> increasing.  See 
>> org.apache.cassandra.service.ClientState#getTimestampForPaxos.
>> From the code:
>>
>>  * Returns a timestamp suitable for paxos given the timestamp of the last
>> known commit (or in progress update).
>>  * Paxos ensures that the timestamp it uses for commits respects the
>> serial order of those commits. It does so
>>  * by having each replica reject any proposal whose timestamp is not
>> strictly greater than the last proposal it
>>  * accepted. So in practice, which timestamp we use for a given proposal
>> doesn't affect correctness but it does
>>  * affect the chance of making progress (if we pick a timestamp lower
>> than what has been proposed before, our
>>  * new proposal will just get rejected).
>>
>> Effectively paxos removes the ability to use custom timestamps and
>> addresses clock variance by rejecting ballots with timestamps less than
>> what was last seen.  You can learn more by reading through the other
>> comments and code in that file.
>>
>> Last write wins is a free for all that guarantees you *nothing* except
>> the timestamp is used as a tiebreaker.  Here we acknowledge things like the
>> speed of light as being a real problem that isn’t going away anytime soon.
>> This problem is sometimes addressed with event sourcing rather than
>> mutating in place.
>>
>> Hope this helps.
>>
>>
>> Jon
>>
>>
>>
>>
>> On Feb 9, 2017, at 5:21 PM, Kant Kodali <[email protected]> wrote:
>>
>> @Justin I read this article http://www.datastax.com/dev/bl
>> og/lightweight-transactions-in-cassandra-2-0. And it clearly says
>> Linearizable consistency can be achieved with LWT's.  so should I assume
>> the Linearizability in the context of the above article is possible with
>> LWT's and synchronization of clocks through ntpd ? because LWT's also
>> follow Last Write Wins. isn't it? Also another question does most of the
>> production clusters do setup ntpd? If so what is the time it takes to sync?
>> any idea
>>
>> @Micheal Schuler Are you referring to  something like true time as in
>> https://static.googleusercontent.com/media/research.google.c
>> om/en//archive/spanner-osdi2012.pdf?  Actually I never heard of setting
>> up GPS modules and how that can be helpful. Let me research on that but
>> good point.
>>
>> On Thu, Feb 9, 2017 at 5:09 PM, Michael Shuler <[email protected]>
>> wrote:
>>
>> If you require the best precision you can get, setting up a pair of
>> stratum 1 ntpd masters in each data center location with a GPS modules
>> is not terribly complex. Low latency and jitter on servers you manage.
>> 140ms is a long way away network-wise, and I would suggest that was a
>> poor choice of upstream (probably stratum 2 or 3) source.
>>
>> As Jonathan mentioned, there's no guarantee from Cassandra, but if you
>> need as close as you can get, you'll probably need to do it yourself.
>>
>> (I run several stratum 2 ntpd servers for pool.ntp.org)
>>
>> --
>> Kind regards,
>> Michael
>>
>> On 02/09/2017 06:47 PM, Kant Kodali wrote:
>> > Hi Justin,
>> >
>> > There are bunch of issues w.r.t to synchronization of clocks when we
>> > used ntpd. Also the time it took to sync the clocks was approx 140ms
>> > (don't quote me on it though because it is reported by our devops :)
>> >
>> > we have multiple clients (for example bunch of micro services are
>> > reading from Cassandra) I am not sure how one can achieve
>> > Linearizability by setting timestamps on the clients ? since there is no
>> > total ordering across multiple clients.
>> >
>> > Thanks!
>> >
>> >
>> > On Thu, Feb 9, 2017 at 4:16 PM, Justin Cameron <[email protected]
>> > <mailto:[email protected]>> wrote:
>> >
>> >     Hi Kant,
>> >
>> >     Clock synchronization is important - you should ensure that ntpd is
>> >     properly configured on all nodes. If your particular use case is
>> >     especially sensitive to out-of-order mutations it is possible to set
>> >     timestamps on the client side using the
>> >     drivers. https://docs.datastax.com/en/d
>> eveloper/java-driver/3.1/manual/query_timestamps/
>> >     <https://docs.datastax.com/en/developer/java-driver/3.1/man
>> ual/query_timestamps/>
>> >
>> >     We use our own NTP cluster to reduce clock drift as much as
>> >     possible, but public NTP servers are good enough for most
>> >     uses. https://www.instaclustr.com/blog/2015/11/05/apache-cassandra
>> -synchronization/
>> >     <https://www.instaclustr.com/blog/2015/11/05/apache-cassand
>> ra-synchronization/>
>> >
>> >     Cheers,
>> >     Justin
>> >
>> >     On Thu, 9 Feb 2017 at 16:09 Kant Kodali <[email protected]
>> >     <mailto:[email protected]>> wrote:
>> >
>> >         How does Cassandra achieve Linearizability with “Last write
>> >         wins” (conflict resolution methods based on time-of-day clocks)
>> ?
>> >
>> >         Relying on synchronized clocks are almost certainly
>> >         non-linearizable, because clock timestamps cannot be guaranteed
>> >         to be consistent with actual event ordering due to clock skew.
>> >         isn't it?
>> >
>> >         Thanks!
>> >
>> >     --
>> >
>> >     Justin Cameron
>> >
>> >     Senior Software Engineer | Instaclustr
>> >
>> >
>> >
>> >
>> >     This email has been sent on behalf of Instaclustr Pty Ltd
>> >     (Australia) and Instaclustr Inc (USA).
>> >
>> >     This email and any attachments may contain confidential and legally
>> >     privileged information.  If you are not the intended recipient, do
>> >     not copy or disclose its content, but please reply to this email
>> >     immediately and highlight the error to the sender and then
>> >     immediately delete the message.
>> >
>> >
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Benjamin Roth
>> Prokurist
>>
>> Jaumo GmbH · www.jaumo.com
>> Wehrstraße 46 · 73035 Göppingen · Germany
>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>> <+49%207161%203048801>
>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>
>>
>>
>>
>>
>>
>> One thing that always bothered me: Intelligent clients and dynamic snitch
>> are designed to attempt to route requests to the same node to attempt to
>> take advantage of cache pinning etc. You would think under these conditions
>> one could naturally elect a "leader" for a "group" of keys that could
>> persist for a few hundred milliseconds and batch up the round trips for a
>> number of operations. Maybe that is what the distinguished coordinator is
>> in some regards.
>>
>>
>>
>>
> My two cents: The current issue is "feature complete" and the author
> stated ready for review 2 years ago. But I can see that as the issue stands
> it forces some hard choices to be made concerning the migration path and in
> depth code changes.
>
> Also I think there is some question (in my mind) as to how we ensure some
> of the subtle contracted/non contracted semantics stay in place. As in they
> work a "certain way" and how confident is everyone that a "better way" does
> not end up causing some pain for someone using it currently. I assume this
> as a common case where a feature request is not being engaged with.
>

Re: How does cassandra achieve Linearizability?

Reply via email to