I think this approach also works for your scenario: I thought that the issue is only concerned with merging within the same leader; but you pointed out that a similar merging happens between leaders too, now I see that the same rules on epoch number also applies to inter-leader data merging, specifically in your case:
everyone starts with epoch of 0, ( they should be same, if not, it also works, we just consider them to be representing diffferent time snapshots of the same counter state) node A add 1 clock: 0.100 (epoch = 0, clock number = 100) node A delete clock: 0.200 node B add 2 clock: 0.300 node A gets B's state: add 2 clock 0.300, but rejects it because A has already produced a delete, with epoch of 0, so A considers epoch 0 already ended, it won't accept any replicated state with epoch < 1. node B gets A's delete 0.200, it zeros its own count of "2", and updates its future expected epoch to 1. at this time, the state of system is: node A expected epoch =1 [A:nil] [B:nil] same for node B let's say we have following further writes: node B add 3 clock 1.400 node A adds 4 clock 1.500 node B receives A's add 4, node B updates its copy of A node A receives B's add 3, updates its copy of B then state is: node A , expected epoch == 1 [A:4 clock=400] [B:3 clock=500] node B same generally I think it should be complete if we add the following rule for inter-leader replication: each leader keeps a var in memory (and also persist to sstable when flushing) expected_epoch , initially set to 0 node P does: on receiving updates from node Q if Q.expected_epoch > P.expected_epoch /** an epoch bump inherently means a previous delete, which we probably missed , so we need to apply the delete a delete is global to all leaders, so apply it on all my replicas **/ for all leaders in my vector count = nil P.expected_epoch = Q.expected_epoch if Q.expected_epoch == P.expected_epoch update P's copy of Q according to standard rules /** if Q.expected_epoch < P.expected_epoch , that means Q is less up to date than us, just ignore replicate_on_write(to Q): if P.operation == delete P.expected_epoch ++ set all my copies of all leaders to nil send to Q ( P.total , P.expected_epoch) overall I don't think delete being not commutative is a fundamental blocker : regular columns are also not commutative, yet we achieve stable result no matter what order they are applied, because of the ordering rule used in reconciliation; here we just need to find a similar ordering rule. the epoch thing could be a step on this direction. Thanks Yang On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jbel...@gmail.com> wrote: > I don't think that's bulletproof either. For instance, what if the > two adds go to replica 1 but the delete to replica 2? > > Bottom line (and this was discussed on the original > delete-for-counters ticket, > https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes > are not fully commutative which makes them fragile. > > On Mon, Jun 13, 2011 at 10:54 AM, Yang <teddyyyy...@gmail.com> wrote: > > as https://issues.apache.org/jira/browse/CASSANDRA-2101 > > indicates, the problem with counter delete is in scenarios like the > > following: > > add 1, clock 100 > > delete , clock 200 > > add 2 , clock 300 > > if the 1st and 3rd operations are merged in SStable compaction, then we > > have > > delete clock 200 > > add 3, clock 300 > > which shows wrong result. > > > > I think a relatively simple extension can be used to complete fix this > > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the > clock, > > so that > > 1) a delete operation increases future epoch number by 1 > > 2) merging of delta adds can be between only deltas of the same epoch, > > deltas of older epoch are simply ignored during merging. merged result > keeps > > the epoch number of the newest seen. > > other operations remain the same as current. note that the above 2 rules > are > > only concerned with merging within the deltas on the leader, and not > related > > to the replicated count, which is a simple final state, and observes the > > rule of "larger clock trumps". naturally the ordering rule is: > epoch1.clock1 > >> epoch2.clock2 iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 > > clock2 > > intuitively "epoch" can be seen as the serial number on a new > "incarnation" > > of a counter. > > > > code change should be mostly localized to CounterColumn.reconcile(), > > although, if an update does not find existing entry in memtable, we need > to > > go to sstable to fetch any possible epoch number, so > > compared to current write path, in the "no replicate-on-write" case, we > need > > to add a read to sstable. but in the "replicate-on-write" case, we > already > > read that, so it's no extra time cost. "no replicate-on-write" is not a > > very useful setup in reality anyway. > > > > does this sound a feasible way? if this works, expiring counter should > > also naturally work. > > > > Thanks > > Yang > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com >