ok, I think it's better to understand it this way, then it is really simple and intuitive:
my proposed way of counter update can be simply seen as a combination of regular columns + current counter columns: regular column : [ value: "wipes out every bucket to nil" , clock: epoch number] then within each epoch, counter updates work as currently implemented On Mon, Jun 13, 2011 at 10:12 AM, Yang <teddyyyy...@gmail.com> wrote: > I think this approach also works for your scenario: > > I thought that the issue is only concerned with merging within the same > leader; but you pointed out > that a similar merging happens between leaders too, now I see that the same > rules on epoch number > also applies to inter-leader data merging, specifically in your case: > > > everyone starts with epoch of 0, ( they should be same, if not, it also > works, we just consider them to be representing diffferent time snapshots of > the same counter state) > > node A add 1 clock: 0.100 (epoch = 0, clock number = 100) > > node A delete clock: 0.200 > > node B add 2 clock: 0.300 > > node A gets B's state: add 2 clock 0.300, but rejects it because A has > already produced a delete, with epoch of 0, so A considers epoch 0 already > ended, it won't accept any replicated state with epoch < 1. > > node B gets A's delete 0.200, it zeros its own count of "2", and > updates its future expected epoch to 1. > > at this time, the state of system is: > node A expected epoch =1 [A:nil] [B:nil] > same for node B > > > > let's say we have following further writes: > > node B add 3 clock 1.400 > > node A adds 4 clock 1.500 > > node B receives A's add 4, node B updates its copy of A > node A receives B's add 3, updates its copy of B > > > then state is: > node A , expected epoch == 1 [A:4 clock=400] [B:3 clock=500] > node B same > > > > generally I think it should be complete if we add the following rule for > inter-leader replication: > > each leader keeps a var in memory (and also persist to sstable when > flushing) expected_epoch , initially set to 0 > > node P does: > on receiving updates from node Q > if Q.expected_epoch > P.expected_epoch > /** an epoch bump inherently means a previous delete, which > we probably missed , so we need to apply the delete > a delete is global to all leaders, so apply it on all my > replicas **/ > for all leaders in my vector > count = nil > > P.expected_epoch = Q.expected_epoch > if Q.expected_epoch == P.expected_epoch > update P's copy of Q according to standard rules > /** if Q.expected_epoch < P.expected_epoch , that means Q is less > up to date than us, just ignore > > > replicate_on_write(to Q): > if P.operation == delete > P.expected_epoch ++ > set all my copies of all leaders to nil > send to Q ( P.total , P.expected_epoch) > > > > > overall I don't think delete being not commutative is a fundamental blocker > : regular columns are also not commutative, yet we achieve stable result no > matter what order they are applied, because of the ordering rule used in > reconciliation; here we just need to find a similar ordering rule. the epoch > thing could be a step on this direction. > > > Thanks > Yang > > > > > On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jbel...@gmail.com> wrote: > >> I don't think that's bulletproof either. For instance, what if the >> two adds go to replica 1 but the delete to replica 2? >> >> Bottom line (and this was discussed on the original >> delete-for-counters ticket, >> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes >> are not fully commutative which makes them fragile. >> >> On Mon, Jun 13, 2011 at 10:54 AM, Yang <teddyyyy...@gmail.com> wrote: >> > as https://issues.apache.org/jira/browse/CASSANDRA-2101 >> > indicates, the problem with counter delete is in scenarios like the >> > following: >> > add 1, clock 100 >> > delete , clock 200 >> > add 2 , clock 300 >> > if the 1st and 3rd operations are merged in SStable compaction, then we >> > have >> > delete clock 200 >> > add 3, clock 300 >> > which shows wrong result. >> > >> > I think a relatively simple extension can be used to complete fix this >> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the >> clock, >> > so that >> > 1) a delete operation increases future epoch number by 1 >> > 2) merging of delta adds can be between only deltas of the same >> epoch, >> > deltas of older epoch are simply ignored during merging. merged result >> keeps >> > the epoch number of the newest seen. >> > other operations remain the same as current. note that the above 2 rules >> are >> > only concerned with merging within the deltas on the leader, and not >> related >> > to the replicated count, which is a simple final state, and observes the >> > rule of "larger clock trumps". naturally the ordering rule is: >> epoch1.clock1 >> >> epoch2.clock2 iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 > >> clock2 >> > intuitively "epoch" can be seen as the serial number on a new >> "incarnation" >> > of a counter. >> > >> > code change should be mostly localized to CounterColumn.reconcile(), >> > although, if an update does not find existing entry in memtable, we >> need to >> > go to sstable to fetch any possible epoch number, so >> > compared to current write path, in the "no replicate-on-write" case, we >> need >> > to add a read to sstable. but in the "replicate-on-write" case, we >> already >> > read that, so it's no extra time cost. "no replicate-on-write" is not a >> > very useful setup in reality anyway. >> > >> > does this sound a feasible way? if this works, expiring counter should >> > also naturally work. >> > >> > Thanks >> > Yang >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of DataStax, the source for professional Cassandra support >> http://www.datastax.com >> > >