I don't think that's bulletproof either. For instance, what if the two adds go to replica 1 but the delete to replica 2?
Bottom line (and this was discussed on the original delete-for-counters ticket, https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes are not fully commutative which makes them fragile. On Mon, Jun 13, 2011 at 10:54 AM, Yang <teddyyyy...@gmail.com> wrote: > as https://issues.apache.org/jira/browse/CASSANDRA-2101 > indicates, the problem with counter delete is in scenarios like the > following: > add 1, clock 100 > delete , clock 200 > add 2 , clock 300 > if the 1st and 3rd operations are merged in SStable compaction, then we > have > delete clock 200 > add 3, clock 300 > which shows wrong result. > > I think a relatively simple extension can be used to complete fix this > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the clock, > so that > 1) a delete operation increases future epoch number by 1 > 2) merging of delta adds can be between only deltas of the same epoch, > deltas of older epoch are simply ignored during merging. merged result keeps > the epoch number of the newest seen. > other operations remain the same as current. note that the above 2 rules are > only concerned with merging within the deltas on the leader, and not related > to the replicated count, which is a simple final state, and observes the > rule of "larger clock trumps". naturally the ordering rule is: epoch1.clock1 >> epoch2.clock2 iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 > clock2 > intuitively "epoch" can be seen as the serial number on a new "incarnation" > of a counter. > > code change should be mostly localized to CounterColumn.reconcile(), > although, if an update does not find existing entry in memtable, we need to > go to sstable to fetch any possible epoch number, so > compared to current write path, in the "no replicate-on-write" case, we need > to add a read to sstable. but in the "replicate-on-write" case, we already > read that, so it's no extra time cost. "no replicate-on-write" is not a > very useful setup in reality anyway. > > does this sound a feasible way? if this works, expiring counter should > also naturally work. > > Thanks > Yang -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com