Re: one way to make counter delete work better

Yang Tue, 14 Jun 2011 19:09:47 -0700

yes epoch is generated by each node, in the replica set,  upon a delete
operation.


epoch is **global** to the replica set, for one counter, in contrast to
clock, with is local to partition.
different counters have different epoch numbers , because different counters
can be seen as completely different state machines, you can view
the nodes in the RF as acting as a separate node for each counter, i.e.
there are millions of replica set, separately, each for one counter

in fact we already have the epoch concept here, just in the
timestampOfLastDelete, but the latter is used in a wrong way, it should
never be compared to timestamp().




On Tue, Jun 14, 2011 at 12:26 PM, Milind Parikh <milindpar...@gmail.com>wrote:

> If I understand this correctly, then the epoch integer would be
> generated by each node. Since time always flows forward, the assumption
> would be, I suppose, that the epochs would be tagged with the node that
> generated them and additionally the counter would carry as much history as
> necessary (and presumably not all history at all times).
>
> Milind
>
>
> On Tue, Jun 14, 2011 at 2:21 PM, Sylvain Lebresne <sylv...@datastax.com>wrote:
>
>> Who assigns those epoch numbers ?
>> You need all nodes to agree on the epoch number somehow to have this work,
>> but then how do you maintain those in a partition tolerant distributed
>> system ?
>>
>> I may have missed some parts of your proposal but let me consider a
>> scenario
>> that we have to be able to handle: consider two nodes A and B (RF=2) each
>> in
>> one data center (DCA and DCB) and a counter c. Suppose you do a +2
>> increment
>> on c that both nodes get. Now let say you have a network split and the
>> connection
>> between your 2 data center fails. In DCA you delete c, only A gets it.
>> In DCB, you
>> do more increments on c (say +3), only B gets it. The partition can
>> last for hours.
>> For deletion to work, we would need that whenever the network
>> partition is resolved,
>> both node eventually agree on the value 3 (i.e, only the second
>> increment).
>> I don't see how you could assign epoch numbers or anything to fix that.
>>
>> --
>> Sylvain
>>
>> On Mon, Jun 13, 2011 at 8:26 PM, Yang <teddyyyy...@gmail.com> wrote:
>> > ok, I think it's better to understand it this way, then it is really
>> simple
>> > and intuitive:
>> > my proposed way of counter update can be simply seen as a combination of
>> > regular columns + current counter columns:
>> > regular column :  [ value: "wipes out every bucket to nil"   , clock:
>> epoch
>> > number]
>> > then within each epoch, counter updates work as currently implemented
>> >
>> >
>> > On Mon, Jun 13, 2011 at 10:12 AM, Yang <teddyyyy...@gmail.com> wrote:
>> >>
>> >> I think this approach also works for your scenario:
>> >> I thought that the issue is only concerned with merging within the same
>> >> leader; but you pointed out
>> >> that a similar merging happens between leaders too, now I see that the
>> >> same rules on epoch number
>> >> also applies to inter-leader data merging, specifically in your case:
>> >>
>> >> everyone starts with epoch of 0, ( they should be same, if not, it also
>> >> works, we just consider them to be representing diffferent time
>> snapshots of
>> >> the same counter state)
>> >> node A      add 1    clock:  0.100  (epoch = 0, clock number = 100)
>> >> node A      delete    clock:  0.200
>> >> node B     add 2     clock:  0.300
>> >> node A    gets B's state:  add 2 clock 0.300, but rejects it because A
>> has
>> >> already produced a delete, with epoch of 0, so A considers epoch 0
>> already
>> >> ended, it won't accept any replicated state with epoch < 1.
>> >> node B    gets A's delete  0.200,  it zeros its own count of "2", and
>> >> updates its future expected epoch to 1.
>> >> at this time, the state of system is:
>> >> node A     expected epoch =1  [A:nil] [B:nil]
>> >> same for node B
>> >>
>> >>
>> >> let's say we have following further writes:
>> >> node B  add 3  clock  1.400
>> >> node A adds 4  clock 1.500
>> >> node B receives A's add 4,   node B updates its copy of A
>> >> node A receives B's add 3,    updates its copy of B
>> >>
>> >> then state is:
>> >> node A  , expected epoch == 1    [A:4  clock=400] [B:3   clock=500]
>> >> node B same
>> >>
>> >>
>> >> generally I think it should be complete if we add the following rule
>> for
>> >> inter-leader replication:
>> >> each leader keeps a var in memory (and also persist to sstable when
>> >> flushing)  expected_epoch , initially set to 0
>> >> node P does:
>> >> on receiving updates from  node Q
>> >>         if Q.expected_epoch > P.expected_epoch
>> >>               /** an epoch bump inherently means a previous delete,
>> which
>> >> we probably missed , so we need to apply the delete
>> >>                   a delete is global to all leaders, so apply it on all
>> my
>> >> replicas **/
>> >>              for all leaders in my vector
>> >>                   count = nil
>> >>
>> >>              P.expected_epoch =  Q.expected_epoch
>> >>         if Q.expected_epoch == P.expected_epoch
>> >>              update P's copy of Q according to standard rules
>> >>         /** if Q.expected_epoch < P.expected_epoch  , that means Q is
>> less
>> >> up to date than us, just ignore
>> >>
>> >> replicate_on_write(to Q):
>> >>       if  P.operation == delete
>> >>             P.expected_epoch ++
>> >>             set all my copies of all leaders to nil
>> >>       send to Q ( P.total , P.expected_epoch)
>> >>
>> >>
>> >>
>> >> overall I don't think delete being not commutative is a fundamental
>> >> blocker : regular columns are also not commutative, yet we achieve
>> stable
>> >> result no matter what order they are applied, because of the ordering
>> rule
>> >> used in reconciliation; here we just need to find a similar ordering
>> rule.
>> >> the epoch thing could be a step on this direction.
>> >>
>> >> Thanks
>> >> Yang
>> >>
>> >>
>> >>
>> >> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jbel...@gmail.com>
>> wrote:
>> >>>
>> >>> I don't think that's bulletproof either.  For instance, what if the
>> >>> two adds go to replica 1 but the delete to replica 2?
>> >>>
>> >>> Bottom line (and this was discussed on the original
>> >>> delete-for-counters ticket,
>> >>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter
>> deletes
>> >>> are not fully commutative which makes them fragile.
>> >>>
>> >>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <teddyyyy...@gmail.com> wrote:
>> >>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101
>> >>> > indicates, the problem with counter delete is  in scenarios like the
>> >>> > following:
>> >>> > add 1, clock 100
>> >>> > delete , clock 200
>> >>> > add  2 , clock 300
>> >>> > if the 1st and 3rd operations are merged in SStable compaction, then
>> we
>> >>> > have
>> >>> > delete  clock 200
>> >>> > add 3,  clock 300
>> >>> > which shows wrong result.
>> >>> >
>> >>> > I think a relatively simple extension can be used to complete fix
>> this
>> >>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the
>> >>> > clock,
>> >>> > so that
>> >>> >    1) a delete operation increases future epoch number by 1
>> >>> >    2) merging of delta adds can be between only deltas of the same
>> >>> > epoch,
>> >>> > deltas of older epoch are simply ignored during merging. merged
>> result
>> >>> > keeps
>> >>> > the epoch number of the newest seen.
>> >>> > other operations remain the same as current. note that the above 2
>> >>> > rules are
>> >>> > only concerned with merging within the deltas on the leader, and not
>> >>> > related
>> >>> > to the replicated count, which is a simple final state, and observes
>> >>> > the
>> >>> > rule of "larger clock trumps". naturally the ordering rule is:
>> >>> > epoch1.clock1
>> >>> >> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 >
>> >>> >> clock2
>> >>> > intuitively "epoch" can be seen as the serial number on a new
>> >>> > "incarnation"
>> >>> > of a counter.
>> >>> >
>> >>> > code change should be mostly localized to CounterColumn.reconcile(),
>> >>> >  although, if an update does not find existing entry in memtable, we
>> >>> > need to
>> >>> > go to sstable to fetch any possible epoch number, so
>> >>> > compared to current write path, in the "no replicate-on-write" case,
>> we
>> >>> > need
>> >>> > to add a read to sstable. but in the "replicate-on-write" case, we
>> >>> > already
>> >>> > read that, so it's no extra time cost.  "no replicate-on-write" is
>> not
>> >>> > a
>> >>> > very useful setup in reality anyway.
>> >>> >
>> >>> > does this sound a feasible way?   if this works, expiring counter
>> >>> > should
>> >>> > also naturally work.
>> >>> >
>> >>> > Thanks
>> >>> > Yang
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Jonathan Ellis
>> >>> Project Chair, Apache Cassandra
>> >>> co-founder of DataStax, the source for professional Cassandra support
>> >>> http://www.datastax.com
>> >>
>> >
>> >
>>
>
>

Re: one way to make counter delete work better

Reply via email to