Re: one way to make counter delete work better

Yang Tue, 14 Jun 2011 19:05:36 -0700

in "stronger reason", I mean the +3 is already merged up in memtable of node
B, you can't find +1 and +2 any more




On Tue, Jun 14, 2011 at 7:02 PM, Yang <teddyyyy...@gmail.com> wrote:

> I almost got the code done, should release in a bit.
>
>
>
> your scenario is not a problem concerned with implementation, but really
> with definition of "same time". remember that in a distributed system, there
> is no absolute physical time concept, time is just another way of saying
> "before or after". in your scenario, since DCA and DCB are cut off, and
> there are no messages between them, you can NOT determine logically whether
> you should say the delete is before +3 or after it. you may say "hey, the
> timestamp I gave +3 is higher", but DCA may say:" your timestamp is just
> drifted, actually my delete happened later"
>
> in fact here is a stronger reason that you have to let go of the +3,
> because it might have already been merged up by +1 , which happened in
> physical time earlier than our DCA delete, and a +2 which happened after the
> DCA delete, now what would you say about whether the +3 is before or after
> our DCA delete? the only correct way to order them is to say:" sorry DCB:
> you missed the delete, all your latter +2 operations were just a snapshot
> earlier in time, the eventual result is the delete. ---- in other words, it
> is futile to update on a dead epoch while others have started a new one".
> this is the same dilemma that you face during sstable merging
>
> overall, I think it's easier to understand it if we realize that once you
> delete, all further edits on the counter is futile, epoch is another way of
> saying creating a completely new counter, the counter name we are using is
> just kind of an alias.
>
>
> yang
>
>
> On Tue, Jun 14, 2011 at 11:21 AM, Sylvain Lebresne 
> <sylv...@datastax.com>wrote:
>
>> Who assigns those epoch numbers ?
>> You need all nodes to agree on the epoch number somehow to have this work,
>> but then how do you maintain those in a partition tolerant distributed
>> system ?
>>
>> I may have missed some parts of your proposal but let me consider a
>> scenario
>> that we have to be able to handle: consider two nodes A and B (RF=2) each
>> in
>> one data center (DCA and DCB) and a counter c. Suppose you do a +2
>> increment
>> on c that both nodes get. Now let say you have a network split and the
>> connection
>> between your 2 data center fails. In DCA you delete c, only A gets it.
>> In DCB, you
>> do more increments on c (say +3), only B gets it. The partition can
>> last for hours.
>> For deletion to work, we would need that whenever the network
>> partition is resolved,
>> both node eventually agree on the value 3 (i.e, only the second
>> increment).
>> I don't see how you could assign epoch numbers or anything to fix that.
>>
>> --
>> Sylvain
>>
>> On Mon, Jun 13, 2011 at 8:26 PM, Yang <teddyyyy...@gmail.com> wrote:
>> > ok, I think it's better to understand it this way, then it is really
>> simple
>> > and intuitive:
>> > my proposed way of counter update can be simply seen as a combination of
>> > regular columns + current counter columns:
>> > regular column :  [ value: "wipes out every bucket to nil"   , clock:
>> epoch
>> > number]
>> > then within each epoch, counter updates work as currently implemented
>> >
>> >
>> > On Mon, Jun 13, 2011 at 10:12 AM, Yang <teddyyyy...@gmail.com> wrote:
>> >>
>> >> I think this approach also works for your scenario:
>> >> I thought that the issue is only concerned with merging within the same
>> >> leader; but you pointed out
>> >> that a similar merging happens between leaders too, now I see that the
>> >> same rules on epoch number
>> >> also applies to inter-leader data merging, specifically in your case:
>> >>
>> >> everyone starts with epoch of 0, ( they should be same, if not, it also
>> >> works, we just consider them to be representing diffferent time
>> snapshots of
>> >> the same counter state)
>> >> node A      add 1    clock:  0.100  (epoch = 0, clock number = 100)
>> >> node A      delete    clock:  0.200
>> >> node B     add 2     clock:  0.300
>> >> node A    gets B's state:  add 2 clock 0.300, but rejects it because A
>> has
>> >> already produced a delete, with epoch of 0, so A considers epoch 0
>> already
>> >> ended, it won't accept any replicated state with epoch < 1.
>> >> node B    gets A's delete  0.200,  it zeros its own count of "2", and
>> >> updates its future expected epoch to 1.
>> >> at this time, the state of system is:
>> >> node A     expected epoch =1  [A:nil] [B:nil]
>> >> same for node B
>> >>
>> >>
>> >> let's say we have following further writes:
>> >> node B  add 3  clock  1.400
>> >> node A adds 4  clock 1.500
>> >> node B receives A's add 4,   node B updates its copy of A
>> >> node A receives B's add 3,    updates its copy of B
>> >>
>> >> then state is:
>> >> node A  , expected epoch == 1    [A:4  clock=400] [B:3   clock=500]
>> >> node B same
>> >>
>> >>
>> >> generally I think it should be complete if we add the following rule
>> for
>> >> inter-leader replication:
>> >> each leader keeps a var in memory (and also persist to sstable when
>> >> flushing)  expected_epoch , initially set to 0
>> >> node P does:
>> >> on receiving updates from  node Q
>> >>         if Q.expected_epoch > P.expected_epoch
>> >>               /** an epoch bump inherently means a previous delete,
>> which
>> >> we probably missed , so we need to apply the delete
>> >>                   a delete is global to all leaders, so apply it on all
>> my
>> >> replicas **/
>> >>              for all leaders in my vector
>> >>                   count = nil
>> >>
>> >>              P.expected_epoch =  Q.expected_epoch
>> >>         if Q.expected_epoch == P.expected_epoch
>> >>              update P's copy of Q according to standard rules
>> >>         /** if Q.expected_epoch < P.expected_epoch  , that means Q is
>> less
>> >> up to date than us, just ignore
>> >>
>> >> replicate_on_write(to Q):
>> >>       if  P.operation == delete
>> >>             P.expected_epoch ++
>> >>             set all my copies of all leaders to nil
>> >>       send to Q ( P.total , P.expected_epoch)
>> >>
>> >>
>> >>
>> >> overall I don't think delete being not commutative is a fundamental
>> >> blocker : regular columns are also not commutative, yet we achieve
>> stable
>> >> result no matter what order they are applied, because of the ordering
>> rule
>> >> used in reconciliation; here we just need to find a similar ordering
>> rule.
>> >> the epoch thing could be a step on this direction.
>> >>
>> >> Thanks
>> >> Yang
>> >>
>> >>
>> >>
>> >> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jbel...@gmail.com>
>> wrote:
>> >>>
>> >>> I don't think that's bulletproof either.  For instance, what if the
>> >>> two adds go to replica 1 but the delete to replica 2?
>> >>>
>> >>> Bottom line (and this was discussed on the original
>> >>> delete-for-counters ticket,
>> >>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter
>> deletes
>> >>> are not fully commutative which makes them fragile.
>> >>>
>> >>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <teddyyyy...@gmail.com> wrote:
>> >>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101
>> >>> > indicates, the problem with counter delete is  in scenarios like the
>> >>> > following:
>> >>> > add 1, clock 100
>> >>> > delete , clock 200
>> >>> > add  2 , clock 300
>> >>> > if the 1st and 3rd operations are merged in SStable compaction, then
>> we
>> >>> > have
>> >>> > delete  clock 200
>> >>> > add 3,  clock 300
>> >>> > which shows wrong result.
>> >>> >
>> >>> > I think a relatively simple extension can be used to complete fix
>> this
>> >>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the
>> >>> > clock,
>> >>> > so that
>> >>> >    1) a delete operation increases future epoch number by 1
>> >>> >    2) merging of delta adds can be between only deltas of the same
>> >>> > epoch,
>> >>> > deltas of older epoch are simply ignored during merging. merged
>> result
>> >>> > keeps
>> >>> > the epoch number of the newest seen.
>> >>> > other operations remain the same as current. note that the above 2
>> >>> > rules are
>> >>> > only concerned with merging within the deltas on the leader, and not
>> >>> > related
>> >>> > to the replicated count, which is a simple final state, and observes
>> >>> > the
>> >>> > rule of "larger clock trumps". naturally the ordering rule is:
>> >>> > epoch1.clock1
>> >>> >> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 >
>> >>> >> clock2
>> >>> > intuitively "epoch" can be seen as the serial number on a new
>> >>> > "incarnation"
>> >>> > of a counter.
>> >>> >
>> >>> > code change should be mostly localized to CounterColumn.reconcile(),
>> >>> >  although, if an update does not find existing entry in memtable, we
>> >>> > need to
>> >>> > go to sstable to fetch any possible epoch number, so
>> >>> > compared to current write path, in the "no replicate-on-write" case,
>> we
>> >>> > need
>> >>> > to add a read to sstable. but in the "replicate-on-write" case, we
>> >>> > already
>> >>> > read that, so it's no extra time cost.  "no replicate-on-write" is
>> not
>> >>> > a
>> >>> > very useful setup in reality anyway.
>> >>> >
>> >>> > does this sound a feasible way?   if this works, expiring counter
>> >>> > should
>> >>> > also naturally work.
>> >>> >
>> >>> > Thanks
>> >>> > Yang
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Jonathan Ellis
>> >>> Project Chair, Apache Cassandra
>> >>> co-founder of DataStax, the source for professional Cassandra support
>> >>> http://www.datastax.com
>> >>
>> >
>> >
>>
>
>

Re: one way to make counter delete work better

Reply via email to