Who assigns those epoch numbers ? You need all nodes to agree on the epoch number somehow to have this work, but then how do you maintain those in a partition tolerant distributed system ?
I may have missed some parts of your proposal but let me consider a scenario that we have to be able to handle: consider two nodes A and B (RF=2) each in one data center (DCA and DCB) and a counter c. Suppose you do a +2 increment on c that both nodes get. Now let say you have a network split and the connection between your 2 data center fails. In DCA you delete c, only A gets it. In DCB, you do more increments on c (say +3), only B gets it. The partition can last for hours. For deletion to work, we would need that whenever the network partition is resolved, both node eventually agree on the value 3 (i.e, only the second increment). I don't see how you could assign epoch numbers or anything to fix that. -- Sylvain On Mon, Jun 13, 2011 at 8:26 PM, Yang <teddyyyy...@gmail.com> wrote: > ok, I think it's better to understand it this way, then it is really simple > and intuitive: > my proposed way of counter update can be simply seen as a combination of > regular columns + current counter columns: > regular column : [ value: "wipes out every bucket to nil" , clock: epoch > number] > then within each epoch, counter updates work as currently implemented > > > On Mon, Jun 13, 2011 at 10:12 AM, Yang <teddyyyy...@gmail.com> wrote: >> >> I think this approach also works for your scenario: >> I thought that the issue is only concerned with merging within the same >> leader; but you pointed out >> that a similar merging happens between leaders too, now I see that the >> same rules on epoch number >> also applies to inter-leader data merging, specifically in your case: >> >> everyone starts with epoch of 0, ( they should be same, if not, it also >> works, we just consider them to be representing diffferent time snapshots of >> the same counter state) >> node A add 1 clock: 0.100 (epoch = 0, clock number = 100) >> node A delete clock: 0.200 >> node B add 2 clock: 0.300 >> node A gets B's state: add 2 clock 0.300, but rejects it because A has >> already produced a delete, with epoch of 0, so A considers epoch 0 already >> ended, it won't accept any replicated state with epoch < 1. >> node B gets A's delete 0.200, it zeros its own count of "2", and >> updates its future expected epoch to 1. >> at this time, the state of system is: >> node A expected epoch =1 [A:nil] [B:nil] >> same for node B >> >> >> let's say we have following further writes: >> node B add 3 clock 1.400 >> node A adds 4 clock 1.500 >> node B receives A's add 4, node B updates its copy of A >> node A receives B's add 3, updates its copy of B >> >> then state is: >> node A , expected epoch == 1 [A:4 clock=400] [B:3 clock=500] >> node B same >> >> >> generally I think it should be complete if we add the following rule for >> inter-leader replication: >> each leader keeps a var in memory (and also persist to sstable when >> flushing) expected_epoch , initially set to 0 >> node P does: >> on receiving updates from node Q >> if Q.expected_epoch > P.expected_epoch >> /** an epoch bump inherently means a previous delete, which >> we probably missed , so we need to apply the delete >> a delete is global to all leaders, so apply it on all my >> replicas **/ >> for all leaders in my vector >> count = nil >> >> P.expected_epoch = Q.expected_epoch >> if Q.expected_epoch == P.expected_epoch >> update P's copy of Q according to standard rules >> /** if Q.expected_epoch < P.expected_epoch , that means Q is less >> up to date than us, just ignore >> >> replicate_on_write(to Q): >> if P.operation == delete >> P.expected_epoch ++ >> set all my copies of all leaders to nil >> send to Q ( P.total , P.expected_epoch) >> >> >> >> overall I don't think delete being not commutative is a fundamental >> blocker : regular columns are also not commutative, yet we achieve stable >> result no matter what order they are applied, because of the ordering rule >> used in reconciliation; here we just need to find a similar ordering rule. >> the epoch thing could be a step on this direction. >> >> Thanks >> Yang >> >> >> >> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jbel...@gmail.com> wrote: >>> >>> I don't think that's bulletproof either. For instance, what if the >>> two adds go to replica 1 but the delete to replica 2? >>> >>> Bottom line (and this was discussed on the original >>> delete-for-counters ticket, >>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes >>> are not fully commutative which makes them fragile. >>> >>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <teddyyyy...@gmail.com> wrote: >>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101 >>> > indicates, the problem with counter delete is in scenarios like the >>> > following: >>> > add 1, clock 100 >>> > delete , clock 200 >>> > add 2 , clock 300 >>> > if the 1st and 3rd operations are merged in SStable compaction, then we >>> > have >>> > delete clock 200 >>> > add 3, clock 300 >>> > which shows wrong result. >>> > >>> > I think a relatively simple extension can be used to complete fix this >>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the >>> > clock, >>> > so that >>> > 1) a delete operation increases future epoch number by 1 >>> > 2) merging of delta adds can be between only deltas of the same >>> > epoch, >>> > deltas of older epoch are simply ignored during merging. merged result >>> > keeps >>> > the epoch number of the newest seen. >>> > other operations remain the same as current. note that the above 2 >>> > rules are >>> > only concerned with merging within the deltas on the leader, and not >>> > related >>> > to the replicated count, which is a simple final state, and observes >>> > the >>> > rule of "larger clock trumps". naturally the ordering rule is: >>> > epoch1.clock1 >>> >> epoch2.clock2 iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 > >>> >> clock2 >>> > intuitively "epoch" can be seen as the serial number on a new >>> > "incarnation" >>> > of a counter. >>> > >>> > code change should be mostly localized to CounterColumn.reconcile(), >>> > although, if an update does not find existing entry in memtable, we >>> > need to >>> > go to sstable to fetch any possible epoch number, so >>> > compared to current write path, in the "no replicate-on-write" case, we >>> > need >>> > to add a read to sstable. but in the "replicate-on-write" case, we >>> > already >>> > read that, so it's no extra time cost. "no replicate-on-write" is not >>> > a >>> > very useful setup in reality anyway. >>> > >>> > does this sound a feasible way? if this works, expiring counter >>> > should >>> > also naturally work. >>> > >>> > Thanks >>> > Yang >>> >>> >>> >>> -- >>> Jonathan Ellis >>> Project Chair, Apache Cassandra >>> co-founder of DataStax, the source for professional Cassandra support >>> http://www.datastax.com >> > >