patch in https://issues.apache.org/jira/browse/CASSANDRA-2774
<https://issues.apache.org/jira/browse/CASSANDRA-2774>some coding is messy and only intended for demonstration only, we could refine it after we agree this is a feasible way to go. Thanks Yang On Tue, Jun 14, 2011 at 11:21 AM, Sylvain Lebresne <sylv...@datastax.com>wrote: > Who assigns those epoch numbers ? > You need all nodes to agree on the epoch number somehow to have this work, > but then how do you maintain those in a partition tolerant distributed > system ? > > I may have missed some parts of your proposal but let me consider a > scenario > that we have to be able to handle: consider two nodes A and B (RF=2) each > in > one data center (DCA and DCB) and a counter c. Suppose you do a +2 > increment > on c that both nodes get. Now let say you have a network split and the > connection > between your 2 data center fails. In DCA you delete c, only A gets it. > In DCB, you > do more increments on c (say +3), only B gets it. The partition can > last for hours. > For deletion to work, we would need that whenever the network > partition is resolved, > both node eventually agree on the value 3 (i.e, only the second increment). > I don't see how you could assign epoch numbers or anything to fix that. > > -- > Sylvain > > On Mon, Jun 13, 2011 at 8:26 PM, Yang <teddyyyy...@gmail.com> wrote: > > ok, I think it's better to understand it this way, then it is really > simple > > and intuitive: > > my proposed way of counter update can be simply seen as a combination of > > regular columns + current counter columns: > > regular column : [ value: "wipes out every bucket to nil" , clock: > epoch > > number] > > then within each epoch, counter updates work as currently implemented > > > > > > On Mon, Jun 13, 2011 at 10:12 AM, Yang <teddyyyy...@gmail.com> wrote: > >> > >> I think this approach also works for your scenario: > >> I thought that the issue is only concerned with merging within the same > >> leader; but you pointed out > >> that a similar merging happens between leaders too, now I see that the > >> same rules on epoch number > >> also applies to inter-leader data merging, specifically in your case: > >> > >> everyone starts with epoch of 0, ( they should be same, if not, it also > >> works, we just consider them to be representing diffferent time > snapshots of > >> the same counter state) > >> node A add 1 clock: 0.100 (epoch = 0, clock number = 100) > >> node A delete clock: 0.200 > >> node B add 2 clock: 0.300 > >> node A gets B's state: add 2 clock 0.300, but rejects it because A > has > >> already produced a delete, with epoch of 0, so A considers epoch 0 > already > >> ended, it won't accept any replicated state with epoch < 1. > >> node B gets A's delete 0.200, it zeros its own count of "2", and > >> updates its future expected epoch to 1. > >> at this time, the state of system is: > >> node A expected epoch =1 [A:nil] [B:nil] > >> same for node B > >> > >> > >> let's say we have following further writes: > >> node B add 3 clock 1.400 > >> node A adds 4 clock 1.500 > >> node B receives A's add 4, node B updates its copy of A > >> node A receives B's add 3, updates its copy of B > >> > >> then state is: > >> node A , expected epoch == 1 [A:4 clock=400] [B:3 clock=500] > >> node B same > >> > >> > >> generally I think it should be complete if we add the following rule for > >> inter-leader replication: > >> each leader keeps a var in memory (and also persist to sstable when > >> flushing) expected_epoch , initially set to 0 > >> node P does: > >> on receiving updates from node Q > >> if Q.expected_epoch > P.expected_epoch > >> /** an epoch bump inherently means a previous delete, > which > >> we probably missed , so we need to apply the delete > >> a delete is global to all leaders, so apply it on all > my > >> replicas **/ > >> for all leaders in my vector > >> count = nil > >> > >> P.expected_epoch = Q.expected_epoch > >> if Q.expected_epoch == P.expected_epoch > >> update P's copy of Q according to standard rules > >> /** if Q.expected_epoch < P.expected_epoch , that means Q is > less > >> up to date than us, just ignore > >> > >> replicate_on_write(to Q): > >> if P.operation == delete > >> P.expected_epoch ++ > >> set all my copies of all leaders to nil > >> send to Q ( P.total , P.expected_epoch) > >> > >> > >> > >> overall I don't think delete being not commutative is a fundamental > >> blocker : regular columns are also not commutative, yet we achieve > stable > >> result no matter what order they are applied, because of the ordering > rule > >> used in reconciliation; here we just need to find a similar ordering > rule. > >> the epoch thing could be a step on this direction. > >> > >> Thanks > >> Yang > >> > >> > >> > >> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jbel...@gmail.com> > wrote: > >>> > >>> I don't think that's bulletproof either. For instance, what if the > >>> two adds go to replica 1 but the delete to replica 2? > >>> > >>> Bottom line (and this was discussed on the original > >>> delete-for-counters ticket, > >>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes > >>> are not fully commutative which makes them fragile. > >>> > >>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <teddyyyy...@gmail.com> wrote: > >>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101 > >>> > indicates, the problem with counter delete is in scenarios like the > >>> > following: > >>> > add 1, clock 100 > >>> > delete , clock 200 > >>> > add 2 , clock 300 > >>> > if the 1st and 3rd operations are merged in SStable compaction, then > we > >>> > have > >>> > delete clock 200 > >>> > add 3, clock 300 > >>> > which shows wrong result. > >>> > > >>> > I think a relatively simple extension can be used to complete fix > this > >>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the > >>> > clock, > >>> > so that > >>> > 1) a delete operation increases future epoch number by 1 > >>> > 2) merging of delta adds can be between only deltas of the same > >>> > epoch, > >>> > deltas of older epoch are simply ignored during merging. merged > result > >>> > keeps > >>> > the epoch number of the newest seen. > >>> > other operations remain the same as current. note that the above 2 > >>> > rules are > >>> > only concerned with merging within the deltas on the leader, and not > >>> > related > >>> > to the replicated count, which is a simple final state, and observes > >>> > the > >>> > rule of "larger clock trumps". naturally the ordering rule is: > >>> > epoch1.clock1 > >>> >> epoch2.clock2 iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 > > >>> >> clock2 > >>> > intuitively "epoch" can be seen as the serial number on a new > >>> > "incarnation" > >>> > of a counter. > >>> > > >>> > code change should be mostly localized to CounterColumn.reconcile(), > >>> > although, if an update does not find existing entry in memtable, we > >>> > need to > >>> > go to sstable to fetch any possible epoch number, so > >>> > compared to current write path, in the "no replicate-on-write" case, > we > >>> > need > >>> > to add a read to sstable. but in the "replicate-on-write" case, we > >>> > already > >>> > read that, so it's no extra time cost. "no replicate-on-write" is > not > >>> > a > >>> > very useful setup in reality anyway. > >>> > > >>> > does this sound a feasible way? if this works, expiring counter > >>> > should > >>> > also naturally work. > >>> > > >>> > Thanks > >>> > Yang > >>> > >>> > >>> > >>> -- > >>> Jonathan Ellis > >>> Project Chair, Apache Cassandra > >>> co-founder of DataStax, the source for professional Cassandra support > >>> http://www.datastax.com > >> > > > > >