Re: one way to make counter delete work better

Yang Mon, 13 Jun 2011 11:27:20 -0700

ok, I think it's better to understand it this way, then it is really simple
and intuitive:


my proposed way of counter update can be simply seen as a combination of
regular columns + current counter columns:

regular column :  [ value: "wipes out every bucket to nil"   , clock: epoch
number]
then within each epoch, counter updates work as currently implemented



On Mon, Jun 13, 2011 at 10:12 AM, Yang <teddyyyy...@gmail.com> wrote:

> I think this approach also works for your scenario:
>
> I thought that the issue is only concerned with merging within the same
> leader; but you pointed out
> that a similar merging happens between leaders too, now I see that the same
> rules on epoch number
> also applies to inter-leader data merging, specifically in your case:
>
>
> everyone starts with epoch of 0, ( they should be same, if not, it also
> works, we just consider them to be representing diffferent time snapshots of
> the same counter state)
>
> node A      add 1    clock:  0.100  (epoch = 0, clock number = 100)
>
> node A      delete    clock:  0.200
>
> node B     add 2     clock:  0.300
>
> node A    gets B's state:  add 2 clock 0.300, but rejects it because A has
> already produced a delete, with epoch of 0, so A considers epoch 0 already
> ended, it won't accept any replicated state with epoch < 1.
>
> node B    gets A's delete  0.200,  it zeros its own count of "2", and
> updates its future expected epoch to 1.
>
> at this time, the state of system is:
> node A     expected epoch =1  [A:nil] [B:nil]
> same for node B
>
>
>
> let's say we have following further writes:
>
> node B  add 3  clock  1.400
>
> node A adds 4  clock 1.500
>
> node B receives A's add 4,   node B updates its copy of A
> node A receives B's add 3,    updates its copy of B
>
>
> then state is:
> node A  , expected epoch == 1    [A:4  clock=400] [B:3   clock=500]
> node B same
>
>
>
> generally I think it should be complete if we add the following rule for
> inter-leader replication:
>
> each leader keeps a var in memory (and also persist to sstable when
> flushing)  expected_epoch , initially set to 0
>
> node P does:
> on receiving updates from  node Q
>         if Q.expected_epoch > P.expected_epoch
>               /** an epoch bump inherently means a previous delete, which
> we probably missed , so we need to apply the delete
>                   a delete is global to all leaders, so apply it on all my
> replicas **/
>              for all leaders in my vector
>                   count = nil
>
>              P.expected_epoch =  Q.expected_epoch
>         if Q.expected_epoch == P.expected_epoch
>              update P's copy of Q according to standard rules
>         /** if Q.expected_epoch < P.expected_epoch  , that means Q is less
> up to date than us, just ignore
>
>
> replicate_on_write(to Q):
>       if  P.operation == delete
>             P.expected_epoch ++
>             set all my copies of all leaders to nil
>       send to Q ( P.total , P.expected_epoch)
>
>
>
>
> overall I don't think delete being not commutative is a fundamental blocker
> : regular columns are also not commutative, yet we achieve stable result no
> matter what order they are applied, because of the ordering rule used in
> reconciliation; here we just need to find a similar ordering rule. the epoch
> thing could be a step on this direction.
>
>
> Thanks
> Yang
>
>
>
>
> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
>
>> I don't think that's bulletproof either.  For instance, what if the
>> two adds go to replica 1 but the delete to replica 2?
>>
>> Bottom line (and this was discussed on the original
>> delete-for-counters ticket,
>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes
>> are not fully commutative which makes them fragile.
>>
>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <teddyyyy...@gmail.com> wrote:
>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101
>> > indicates, the problem with counter delete is  in scenarios like the
>> > following:
>> > add 1, clock 100
>> > delete , clock 200
>> > add  2 , clock 300
>> > if the 1st and 3rd operations are merged in SStable compaction, then we
>> > have
>> > delete  clock 200
>> > add 3,  clock 300
>> > which shows wrong result.
>> >
>> > I think a relatively simple extension can be used to complete fix this
>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the
>> clock,
>> > so that
>> >    1) a delete operation increases future epoch number by 1
>> >    2) merging of delta adds can be between only deltas of the same
>> epoch,
>> > deltas of older epoch are simply ignored during merging. merged result
>> keeps
>> > the epoch number of the newest seen.
>> > other operations remain the same as current. note that the above 2 rules
>> are
>> > only concerned with merging within the deltas on the leader, and not
>> related
>> > to the replicated count, which is a simple final state, and observes the
>> > rule of "larger clock trumps". naturally the ordering rule is:
>> epoch1.clock1
>> >> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 >
>> clock2
>> > intuitively "epoch" can be seen as the serial number on a new
>> "incarnation"
>> > of a counter.
>> >
>> > code change should be mostly localized to CounterColumn.reconcile(),
>> >  although, if an update does not find existing entry in memtable, we
>> need to
>> > go to sstable to fetch any possible epoch number, so
>> > compared to current write path, in the "no replicate-on-write" case, we
>> need
>> > to add a read to sstable. but in the "replicate-on-write" case, we
>> already
>> > read that, so it's no extra time cost.  "no replicate-on-write" is not a
>> > very useful setup in reality anyway.
>> >
>> > does this sound a feasible way?   if this works, expiring counter should
>> > also naturally work.
>> >
>> > Thanks
>> > Yang
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>>
>
>

Re: one way to make counter delete work better

Reply via email to