as https://issues.apache.org/jira/browse/CASSANDRA-2101 indicates, the problem with counter delete is in scenarios like the following:
add 1, clock 100 delete , clock 200 add 2 , clock 300 if the 1st and 3rd operations are merged in SStable compaction, then we have delete clock 200 add 3, clock 300 which shows wrong result. I think a relatively simple extension can be used to complete fix this issue: similar to ZooKeeper, we can prefix an "Epoch" number to the clock, so that 1) a delete operation increases future epoch number by 1 2) merging of delta adds can be between only deltas of the same epoch, deltas of older epoch are simply ignored during merging. merged result keeps the epoch number of the newest seen. other operations remain the same as current. note that the above 2 rules are only concerned with merging within the deltas on the leader, and not related to the replicated count, which is a simple final state, and observes the rule of "larger clock trumps". naturally the ordering rule is: epoch1.clock1 > epoch2.clock2 iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 > clock2 intuitively "epoch" can be seen as the serial number on a new "incarnation" of a counter. code change should be mostly localized to CounterColumn.reconcile(), although, if an update does not find existing entry in memtable, we need to go to sstable to fetch any possible epoch number, so compared to current write path, in the "no replicate-on-write" case, we need to add a read to sstable. but in the "replicate-on-write" case, we already read that, so it's no extra time cost. "no replicate-on-write" is not a very useful setup in reality anyway. does this sound a feasible way? if this works, expiring counter should also naturally work. Thanks Yang