in "stronger reason", I mean the +3 is already merged up in memtable of node B, you can't find +1 and +2 any more
On Tue, Jun 14, 2011 at 7:02 PM, Yang <teddyyyy...@gmail.com> wrote: > I almost got the code done, should release in a bit. > > > > your scenario is not a problem concerned with implementation, but really > with definition of "same time". remember that in a distributed system, there > is no absolute physical time concept, time is just another way of saying > "before or after". in your scenario, since DCA and DCB are cut off, and > there are no messages between them, you can NOT determine logically whether > you should say the delete is before +3 or after it. you may say "hey, the > timestamp I gave +3 is higher", but DCA may say:" your timestamp is just > drifted, actually my delete happened later" > > in fact here is a stronger reason that you have to let go of the +3, > because it might have already been merged up by +1 , which happened in > physical time earlier than our DCA delete, and a +2 which happened after the > DCA delete, now what would you say about whether the +3 is before or after > our DCA delete? the only correct way to order them is to say:" sorry DCB: > you missed the delete, all your latter +2 operations were just a snapshot > earlier in time, the eventual result is the delete. ---- in other words, it > is futile to update on a dead epoch while others have started a new one". > this is the same dilemma that you face during sstable merging > > overall, I think it's easier to understand it if we realize that once you > delete, all further edits on the counter is futile, epoch is another way of > saying creating a completely new counter, the counter name we are using is > just kind of an alias. > > > yang > > > On Tue, Jun 14, 2011 at 11:21 AM, Sylvain Lebresne > <sylv...@datastax.com>wrote: > >> Who assigns those epoch numbers ? >> You need all nodes to agree on the epoch number somehow to have this work, >> but then how do you maintain those in a partition tolerant distributed >> system ? >> >> I may have missed some parts of your proposal but let me consider a >> scenario >> that we have to be able to handle: consider two nodes A and B (RF=2) each >> in >> one data center (DCA and DCB) and a counter c. Suppose you do a +2 >> increment >> on c that both nodes get. Now let say you have a network split and the >> connection >> between your 2 data center fails. In DCA you delete c, only A gets it. >> In DCB, you >> do more increments on c (say +3), only B gets it. The partition can >> last for hours. >> For deletion to work, we would need that whenever the network >> partition is resolved, >> both node eventually agree on the value 3 (i.e, only the second >> increment). >> I don't see how you could assign epoch numbers or anything to fix that. >> >> -- >> Sylvain >> >> On Mon, Jun 13, 2011 at 8:26 PM, Yang <teddyyyy...@gmail.com> wrote: >> > ok, I think it's better to understand it this way, then it is really >> simple >> > and intuitive: >> > my proposed way of counter update can be simply seen as a combination of >> > regular columns + current counter columns: >> > regular column : [ value: "wipes out every bucket to nil" , clock: >> epoch >> > number] >> > then within each epoch, counter updates work as currently implemented >> > >> > >> > On Mon, Jun 13, 2011 at 10:12 AM, Yang <teddyyyy...@gmail.com> wrote: >> >> >> >> I think this approach also works for your scenario: >> >> I thought that the issue is only concerned with merging within the same >> >> leader; but you pointed out >> >> that a similar merging happens between leaders too, now I see that the >> >> same rules on epoch number >> >> also applies to inter-leader data merging, specifically in your case: >> >> >> >> everyone starts with epoch of 0, ( they should be same, if not, it also >> >> works, we just consider them to be representing diffferent time >> snapshots of >> >> the same counter state) >> >> node A add 1 clock: 0.100 (epoch = 0, clock number = 100) >> >> node A delete clock: 0.200 >> >> node B add 2 clock: 0.300 >> >> node A gets B's state: add 2 clock 0.300, but rejects it because A >> has >> >> already produced a delete, with epoch of 0, so A considers epoch 0 >> already >> >> ended, it won't accept any replicated state with epoch < 1. >> >> node B gets A's delete 0.200, it zeros its own count of "2", and >> >> updates its future expected epoch to 1. >> >> at this time, the state of system is: >> >> node A expected epoch =1 [A:nil] [B:nil] >> >> same for node B >> >> >> >> >> >> let's say we have following further writes: >> >> node B add 3 clock 1.400 >> >> node A adds 4 clock 1.500 >> >> node B receives A's add 4, node B updates its copy of A >> >> node A receives B's add 3, updates its copy of B >> >> >> >> then state is: >> >> node A , expected epoch == 1 [A:4 clock=400] [B:3 clock=500] >> >> node B same >> >> >> >> >> >> generally I think it should be complete if we add the following rule >> for >> >> inter-leader replication: >> >> each leader keeps a var in memory (and also persist to sstable when >> >> flushing) expected_epoch , initially set to 0 >> >> node P does: >> >> on receiving updates from node Q >> >> if Q.expected_epoch > P.expected_epoch >> >> /** an epoch bump inherently means a previous delete, >> which >> >> we probably missed , so we need to apply the delete >> >> a delete is global to all leaders, so apply it on all >> my >> >> replicas **/ >> >> for all leaders in my vector >> >> count = nil >> >> >> >> P.expected_epoch = Q.expected_epoch >> >> if Q.expected_epoch == P.expected_epoch >> >> update P's copy of Q according to standard rules >> >> /** if Q.expected_epoch < P.expected_epoch , that means Q is >> less >> >> up to date than us, just ignore >> >> >> >> replicate_on_write(to Q): >> >> if P.operation == delete >> >> P.expected_epoch ++ >> >> set all my copies of all leaders to nil >> >> send to Q ( P.total , P.expected_epoch) >> >> >> >> >> >> >> >> overall I don't think delete being not commutative is a fundamental >> >> blocker : regular columns are also not commutative, yet we achieve >> stable >> >> result no matter what order they are applied, because of the ordering >> rule >> >> used in reconciliation; here we just need to find a similar ordering >> rule. >> >> the epoch thing could be a step on this direction. >> >> >> >> Thanks >> >> Yang >> >> >> >> >> >> >> >> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jbel...@gmail.com> >> wrote: >> >>> >> >>> I don't think that's bulletproof either. For instance, what if the >> >>> two adds go to replica 1 but the delete to replica 2? >> >>> >> >>> Bottom line (and this was discussed on the original >> >>> delete-for-counters ticket, >> >>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter >> deletes >> >>> are not fully commutative which makes them fragile. >> >>> >> >>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <teddyyyy...@gmail.com> wrote: >> >>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101 >> >>> > indicates, the problem with counter delete is in scenarios like the >> >>> > following: >> >>> > add 1, clock 100 >> >>> > delete , clock 200 >> >>> > add 2 , clock 300 >> >>> > if the 1st and 3rd operations are merged in SStable compaction, then >> we >> >>> > have >> >>> > delete clock 200 >> >>> > add 3, clock 300 >> >>> > which shows wrong result. >> >>> > >> >>> > I think a relatively simple extension can be used to complete fix >> this >> >>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the >> >>> > clock, >> >>> > so that >> >>> > 1) a delete operation increases future epoch number by 1 >> >>> > 2) merging of delta adds can be between only deltas of the same >> >>> > epoch, >> >>> > deltas of older epoch are simply ignored during merging. merged >> result >> >>> > keeps >> >>> > the epoch number of the newest seen. >> >>> > other operations remain the same as current. note that the above 2 >> >>> > rules are >> >>> > only concerned with merging within the deltas on the leader, and not >> >>> > related >> >>> > to the replicated count, which is a simple final state, and observes >> >>> > the >> >>> > rule of "larger clock trumps". naturally the ordering rule is: >> >>> > epoch1.clock1 >> >>> >> epoch2.clock2 iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 > >> >>> >> clock2 >> >>> > intuitively "epoch" can be seen as the serial number on a new >> >>> > "incarnation" >> >>> > of a counter. >> >>> > >> >>> > code change should be mostly localized to CounterColumn.reconcile(), >> >>> > although, if an update does not find existing entry in memtable, we >> >>> > need to >> >>> > go to sstable to fetch any possible epoch number, so >> >>> > compared to current write path, in the "no replicate-on-write" case, >> we >> >>> > need >> >>> > to add a read to sstable. but in the "replicate-on-write" case, we >> >>> > already >> >>> > read that, so it's no extra time cost. "no replicate-on-write" is >> not >> >>> > a >> >>> > very useful setup in reality anyway. >> >>> > >> >>> > does this sound a feasible way? if this works, expiring counter >> >>> > should >> >>> > also naturally work. >> >>> > >> >>> > Thanks >> >>> > Yang >> >>> >> >>> >> >>> >> >>> -- >> >>> Jonathan Ellis >> >>> Project Chair, Apache Cassandra >> >>> co-founder of DataStax, the source for professional Cassandra support >> >>> http://www.datastax.com >> >> >> > >> > >> > >