If anyone on the thread hasn't read Helland's Building on Quicksand paper, yet, it discusses the abstract strategy behind #1072's distributed counter implementation. It's here: http://blogs.msdn.com/b/pathelland/archive/2008/12/12/building-on-quicksand-paper-for-cidr-conference-on-innovative-database-research.aspx
The relevant sections are 5 & 6 (around 3-4 pages). I would highly recommend reviewing it. Let me clarify some issues for the thread: #580 (version vectors) and Cages (distributed locks via ZK; see: http://code.google.com/p/cages/) both require a read before every write. If you read the Helland paper, he postulates that idempotence is only required between every partition--not every update. #1072 coalesces commutative operations on each partition (i.e. replica) and idempotently repairs them between partitions / replicas. The alternative approach proposed, inserting UUID columns into a row, needs to repair every update between replicas. Unless there is an aggregation step that I'm not seeing. ntm, I'm w/ Ben in that I'd rather not push the aggregation step back up to the client. #1072 does provide a building block--distributed commutative operations. Addition is just the most salient commutative operation. However, I can imagine that there are many other commutative operations that ppl would find useful to perform at scale. #1072 does fit into the EC model. Granted, the level of consistency is not tunable, though. It's fixed at CL.ONE writes. However, any use case that relies on CL.ONE writes requires CL.ALL reads for strong consistency. If you're willing to tolerate a certain amount of inconsistency, then CL.ONE reads are fine. At Digg, we do CL.ONE writes and CL.ONE reads and tolerate a certain amount of inconsistency. And, as Sylvain brought up on the #1072 issue, we have a variant of read repair, called: repair-on-write. Where, on write, the current replica's count is read, then written to the other replicas. It's not implemented on the patch submitted to the issue, but it's been discussed and implemented, elsewhere. -Kelvin On Fri, Aug 13, 2010 at 8:49 AM, Benjamin Black <b...@b3k.us> wrote: > On Fri, Aug 13, 2010 at 6:24 AM, Jonathan Ellis <jbel...@gmail.com> wrote: >>> >>> This is simply not an acceptable alternative and just can't be called >>> handling it "well". >> >> What part is it handling poorly, at a technical level? This is almost >> exactly what 1072 does internally -- we are concerned here with the >> high write, low read volume case. >> > > Requiring clients directly manage the counter rows in order to > periodically compress or segment them. Yes, you can emulate the > behavior. No, that is not handling it well. > >>> It is equivalent to "make the users do it", which >>> is the case for almost anything. >> >> I strongly feel we should be in the business of providing building >> blocks, not special cases on top of that. (But see below, I *do* >> think the 580 version vectors is the kind of building block we want!) >> > > I agree, 580 is really valuable and should be in. The problem for > high write rate, distributed counters is the requirement of read > before write inherent in such vector-based approaches. Am I missing > some aspect of 580 that precludes that? > >>> The reasons #1072 is so valuable: >>> >>> 1) Does not require _any_ user action. >> >> This can be addressed at the library level. Just as our first stab at >> ZK integration was a rather clunky patch; "cages" is better. >> > > Certainly, but it would be hard to argue (and I am not) that the > tightly synchronized behavior of ZK is a good match for Cassandra > (mixing in Paxos could make for some neat options, but that's another > debate...). > >>> 2) Does not change the EC-centric model of Cassandra. >> >> It does, though. 1072 is *not* a version vector-based approach -- >> that would be 580. Read the 1072 design doc, if you haven't. (Thanks >> to Kelvin for writing that up!) >> > > Nor is Cassandra right now. I know 1072 isn't vector based, and I > think that is in its favor _for this application_. > >> I'm referring in particular to reads requiring CL.ALL. (My >> understanding is that in the previous design, a "master" replica was >> chosen and was always written to first.) Both of these break "the >> EC-centric model" and that is precisely the objection I made when I >> said "ConsistencyLevel is not respected." I don't think this is >> fixable in the 1072 approach. I would be thrilled to be wrong. >> > > It is EC in that the total for a counter is unknown until resolved on > read. Yes, it does not respect CL, but since it can only be used in 1 > way, I don't see that as a disadvantage. > >>>> The second is that the approach in 1072 resembles an entirely separate >>>> system that happens to use part of Cassandra infrastructure -- the >>>> thrift API, the MessagingService, the sstable format -- but isn't >>>> really part of it. ConsistencyLevel is not respected, and special >>>> cases abound to weld things in that don't fit, e.g. the AES/Streaming >>>> business. >>> >>> Then let's find ways to make it as elegant as it can be. Ultimately, >>> this functionality needs to be in Cassandra or users will simply >>> migrate someplace else for this extremely common use case. >> >> This is what I've been pushing for. The version vector approach to >> counting (i.e. 580 as opposed to 1072) is exactly the more elegant, >> EC-centric approach that addresses a case that we *don't* currently >> handle well (counters with a higher read volume than 1072). >> > > Perhaps I missed something: does counting 580 require read before > counter update (local to the node, not a client read)? > > > b >