I hope that patch is reviewed as quickly as possible. We use LWT's heavily and we are getting a throughput of 600 writes/sec and each write is 1KB in our case.
On Wed, Feb 22, 2017 at 7:48 AM, Edward Capriolo <[email protected]> wrote: > > > On Wed, Feb 22, 2017 at 9:47 AM, Ariel Weisberg <[email protected]> wrote: > >> Hi, >> >> No it's not going to be in 3.11.x. The earliest release it could make it >> into is 4.0. >> >> Ariel >> >> On Wed, Feb 22, 2017, at 03:34 AM, Kant Kodali wrote: >> >> Hi Ariel, >> >> Can we really expect the fix in 3.11.x as the ticket >> https://issues.apache.org/jira/browse/CASSANDRA-6246 >> <https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22> >> says? >> >> Thanks, >> kant >> >> On Thu, Feb 16, 2017 at 2:12 PM, Ariel Weisberg <[email protected]> >> wrote: >> >> >> Hi, >> >> That would work and would help a lot with the dueling proposer issue. >> >> A lot of the leader election stuff is designed to reduce the number of >> roundtrips and not just address the dueling proposer issue. Those will have >> downtime because it's there for correctness. Just adding an affinity for a >> specific proposer is probably a free lunch. >> >> I don't think you can group keys because the Paxos proposals are per >> partition which is why we get linear scale out for Paxos. I don't believe >> it's linearizable across multiple partitions. You can use the clustering >> key and deterministically pick one of the live replicas for that clustering >> key. Sort the list of replicas by IP, hash the clustering key, use the hash >> as an index into the list of replicas. >> >> Batching is of limited usefulness because we only use Paxos for CAS I >> think? So in a batch by definition all but one will fail the CAS. This is >> something where a distinguished coordinator could help by failing the rest >> of the contending requests more inexpensively than it currently does. >> >> >> Ariel >> >> On Thu, Feb 16, 2017, at 04:55 PM, Edward Capriolo wrote: >> >> >> >> On Thu, Feb 16, 2017 at 4:33 PM, Ariel Weisberg <[email protected]> >> wrote: >> >> >> Hi, >> >> Classic Paxos doesn't have a leader. There are variants on the original >> Lamport approach that will elect a leader (or some other variation like >> Mencius) to improve throughput, latency, and performance under contention. >> Cassandra implements the approach from the beginning of "Paxos Made Simple" >> (https://goo.gl/SrP0Wb) with no additional optimizations that I am aware >> of. There is no distinguished proposer (leader). >> >> That paper does go on to discuss electing a distinguished proposer, but >> that was never done for C*. I believe it's not considered a good fit for C* >> philosophically. >> >> Ariel >> >> On Thu, Feb 16, 2017, at 04:20 PM, Kant Kodali wrote: >> >> @Ariel Weisberg EPaxos looks very interesting as it looks like it doesn't >> need any designated leader for C* but I am assuming the paxos that is >> implemented today for LWT's requires Leader election and If so, don't we >> need to have an odd number of nodes or racks or DC's to satisfy N = 2F + 1 >> constraint to tolerate F failures ? I understand it is not needed when not >> using LWT's since Cassandra is a master-less system. >> >> On Fri, Feb 10, 2017 at 10:25 AM, Kant Kodali <[email protected]> wrote: >> >> Thanks Ariel! Yes I knew there are so many variations and optimizations >> of Paxos. I just wanted to see if we had any plans on improving the >> existing Paxos implementation and it is great to see the work is under >> progress! I am going to follow that ticket and read up the references >> pointed in it >> >> >> On Fri, Feb 10, 2017 at 8:33 AM, Ariel Weisberg <[email protected]> >> wrote: >> >> >> Hi, >> >> Cassandra's implementation of Paxos doesn't implement many optimizations >> that would drastically improve throughput and latency. You need consensus, >> but it doesn't have to be exorbitantly expensive and fall over under any >> kind of contention. >> >> For instance you could implement EPaxos https://issues.apache.o >> rg/jira/browse/CASSANDRA-6246 >> <https://issues.apache.org/jira/browse/CASSANDRA-6246?jql=text%20~%20%22epaxos%22>, >> batch multiple operations into the same Paxos round, have an affinity for a >> specific proposer for a specific partition, implement asynchronous commit, >> use a more efficient implementation of the Paxos log, and maybe other >> things. >> >> >> Ariel >> >> >> >> On Fri, Feb 10, 2017, at 05:31 AM, Benjamin Roth wrote: >> >> Hi Kant, >> >> If you read the published papers about Paxos, you will most probably >> recognize that there is no way to "do it better". This is a conceptional >> thing due to the nature of distributed systems + the CAP theorem. >> If you want A+P in the triangle, then C is very expensive. CS is made for >> A+P mostly with tunable C. In ACID databases this is a completely different >> thing as they are mostly either not partition tolerant, not highly >> available or not scalable (in a distributed manner, not speaking of >> "monolithic super servers"). >> >> There is no free lunch ... >> >> >> 2017-02-10 11:09 GMT+01:00 Kant Kodali <[email protected]>: >> >> "That’s the safety blanket everyone wants but is extremely expensive, >> especially in Cassandra." >> >> yes LWT's are expensive. Are there any plans to make this better? >> >> On Fri, Feb 10, 2017 at 12:17 AM, Kant Kodali <[email protected]> wrote: >> >> Hi Jon, >> >> Thanks a lot for your response. I am well aware that the LWW != LWT but I >> was talking more in terms of LWW with respective to LWT's which I believe >> you answered. so thanks much! >> >> >> kant >> >> >> On Thu, Feb 9, 2017 at 6:01 PM, Jon Haddad <[email protected]> >> wrote: >> >> LWT != Last Write Wins. They are totally different. >> >> LWTs give you (assuming you also read at SERIAL) “atomic consistency”, >> meaning you are able to perform operations atomically and in isolation. >> That’s the safety blanket everyone wants but is extremely expensive, >> especially in Cassandra. The lightweight part, btw, may be a little >> optimistic, especially if a key is under contention. With regard to the >> “last write” part you’re asking about - w/ LWT Cassandra provides the >> timestamp and manages it as part of the ballot, and it always is >> increasing. See >> org.apache.cassandra.service.ClientState#getTimestampForPaxos. >> From the code: >> >> * Returns a timestamp suitable for paxos given the timestamp of the last >> known commit (or in progress update). >> * Paxos ensures that the timestamp it uses for commits respects the >> serial order of those commits. It does so >> * by having each replica reject any proposal whose timestamp is not >> strictly greater than the last proposal it >> * accepted. So in practice, which timestamp we use for a given proposal >> doesn't affect correctness but it does >> * affect the chance of making progress (if we pick a timestamp lower >> than what has been proposed before, our >> * new proposal will just get rejected). >> >> Effectively paxos removes the ability to use custom timestamps and >> addresses clock variance by rejecting ballots with timestamps less than >> what was last seen. You can learn more by reading through the other >> comments and code in that file. >> >> Last write wins is a free for all that guarantees you *nothing* except >> the timestamp is used as a tiebreaker. Here we acknowledge things like the >> speed of light as being a real problem that isn’t going away anytime soon. >> This problem is sometimes addressed with event sourcing rather than >> mutating in place. >> >> Hope this helps. >> >> >> Jon >> >> >> >> >> On Feb 9, 2017, at 5:21 PM, Kant Kodali <[email protected]> wrote: >> >> @Justin I read this article http://www.datastax.com/dev/bl >> og/lightweight-transactions-in-cassandra-2-0. And it clearly says >> Linearizable consistency can be achieved with LWT's. so should I assume >> the Linearizability in the context of the above article is possible with >> LWT's and synchronization of clocks through ntpd ? because LWT's also >> follow Last Write Wins. isn't it? Also another question does most of the >> production clusters do setup ntpd? If so what is the time it takes to sync? >> any idea >> >> @Micheal Schuler Are you referring to something like true time as in >> https://static.googleusercontent.com/media/research.google.c >> om/en//archive/spanner-osdi2012.pdf? Actually I never heard of setting >> up GPS modules and how that can be helpful. Let me research on that but >> good point. >> >> On Thu, Feb 9, 2017 at 5:09 PM, Michael Shuler <[email protected]> >> wrote: >> >> If you require the best precision you can get, setting up a pair of >> stratum 1 ntpd masters in each data center location with a GPS modules >> is not terribly complex. Low latency and jitter on servers you manage. >> 140ms is a long way away network-wise, and I would suggest that was a >> poor choice of upstream (probably stratum 2 or 3) source. >> >> As Jonathan mentioned, there's no guarantee from Cassandra, but if you >> need as close as you can get, you'll probably need to do it yourself. >> >> (I run several stratum 2 ntpd servers for pool.ntp.org) >> >> -- >> Kind regards, >> Michael >> >> On 02/09/2017 06:47 PM, Kant Kodali wrote: >> > Hi Justin, >> > >> > There are bunch of issues w.r.t to synchronization of clocks when we >> > used ntpd. Also the time it took to sync the clocks was approx 140ms >> > (don't quote me on it though because it is reported by our devops :) >> > >> > we have multiple clients (for example bunch of micro services are >> > reading from Cassandra) I am not sure how one can achieve >> > Linearizability by setting timestamps on the clients ? since there is no >> > total ordering across multiple clients. >> > >> > Thanks! >> > >> > >> > On Thu, Feb 9, 2017 at 4:16 PM, Justin Cameron <[email protected] >> > <mailto:[email protected]>> wrote: >> > >> > Hi Kant, >> > >> > Clock synchronization is important - you should ensure that ntpd is >> > properly configured on all nodes. If your particular use case is >> > especially sensitive to out-of-order mutations it is possible to set >> > timestamps on the client side using the >> > drivers. https://docs.datastax.com/en/d >> eveloper/java-driver/3.1/manual/query_timestamps/ >> > <https://docs.datastax.com/en/developer/java-driver/3.1/man >> ual/query_timestamps/> >> > >> > We use our own NTP cluster to reduce clock drift as much as >> > possible, but public NTP servers are good enough for most >> > uses. https://www.instaclustr.com/blog/2015/11/05/apache-cassandra >> -synchronization/ >> > <https://www.instaclustr.com/blog/2015/11/05/apache-cassand >> ra-synchronization/> >> > >> > Cheers, >> > Justin >> > >> > On Thu, 9 Feb 2017 at 16:09 Kant Kodali <[email protected] >> > <mailto:[email protected]>> wrote: >> > >> > How does Cassandra achieve Linearizability with “Last write >> > wins” (conflict resolution methods based on time-of-day clocks) >> ? >> > >> > Relying on synchronized clocks are almost certainly >> > non-linearizable, because clock timestamps cannot be guaranteed >> > to be consistent with actual event ordering due to clock skew. >> > isn't it? >> > >> > Thanks! >> > >> > -- >> > >> > Justin Cameron >> > >> > Senior Software Engineer | Instaclustr >> > >> > >> > >> > >> > This email has been sent on behalf of Instaclustr Pty Ltd >> > (Australia) and Instaclustr Inc (USA). >> > >> > This email and any attachments may contain confidential and legally >> > privileged information. If you are not the intended recipient, do >> > not copy or disclose its content, but please reply to this email >> > immediately and highlight the error to the sender and then >> > immediately delete the message. >> > >> > >> >> >> >> >> >> >> >> >> >> -- >> Benjamin Roth >> Prokurist >> >> Jaumo GmbH · www.jaumo.com >> Wehrstraße 46 · 73035 Göppingen · Germany >> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1 >> <+49%207161%203048801> >> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer >> >> >> >> >> >> >> One thing that always bothered me: Intelligent clients and dynamic snitch >> are designed to attempt to route requests to the same node to attempt to >> take advantage of cache pinning etc. You would think under these conditions >> one could naturally elect a "leader" for a "group" of keys that could >> persist for a few hundred milliseconds and batch up the round trips for a >> number of operations. Maybe that is what the distinguished coordinator is >> in some regards. >> >> >> >> > My two cents: The current issue is "feature complete" and the author > stated ready for review 2 years ago. But I can see that as the issue stands > it forces some hard choices to be made concerning the migration path and in > depth code changes. > > Also I think there is some question (in my mind) as to how we ensure some > of the subtle contracted/non contracted semantics stay in place. As in they > work a "certain way" and how confident is everyone that a "better way" does > not end up causing some pain for someone using it currently. I assume this > as a common case where a feature request is not being engaged with. >
