as https://issues.apache.org/jira/browse/CASSANDRA-2101
indicates, the problem with counter delete is  in scenarios like the
following:

add 1, clock 100
delete , clock 200
add  2 , clock 300

if the 1st and 3rd operations are merged in SStable compaction, then we
have
delete  clock 200
add 3,  clock 300

which shows wrong result.


I think a relatively simple extension can be used to complete fix this
issue: similar to ZooKeeper, we can prefix an "Epoch" number to the clock,
so that
   1) a delete operation increases future epoch number by 1
   2) merging of delta adds can be between only deltas of the same epoch,
deltas of older epoch are simply ignored during merging. merged result keeps
the epoch number of the newest seen.

other operations remain the same as current. note that the above 2 rules are
only concerned with merging within the deltas on the leader, and not related
to the replicated count, which is a simple final state, and observes the
rule of "larger clock trumps". naturally the ordering rule is: epoch1.clock1
> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 > clock2

intuitively "epoch" can be seen as the serial number on a new "incarnation"
of a counter.


code change should be mostly localized to CounterColumn.reconcile(),
 although, if an update does not find existing entry in memtable, we need to
go to sstable to fetch any possible epoch number, so
compared to current write path, in the "no replicate-on-write" case, we need
to add a read to sstable. but in the "replicate-on-write" case, we already
read that, so it's no extra time cost.  "no replicate-on-write" is not a
very useful setup in reality anyway.


does this sound a feasible way?   if this works, expiring counter should
also naturally work.


Thanks
Yang

Reply via email to