I think this approach also works for your scenario:

I thought that the issue is only concerned with merging within the same
leader; but you pointed out
that a similar merging happens between leaders too, now I see that the same
rules on epoch number
also applies to inter-leader data merging, specifically in your case:


everyone starts with epoch of 0, ( they should be same, if not, it also
works, we just consider them to be representing diffferent time snapshots of
the same counter state)

node A      add 1    clock:  0.100  (epoch = 0, clock number = 100)

node A      delete    clock:  0.200

node B     add 2     clock:  0.300

node A    gets B's state:  add 2 clock 0.300, but rejects it because A has
already produced a delete, with epoch of 0, so A considers epoch 0 already
ended, it won't accept any replicated state with epoch < 1.

node B    gets A's delete  0.200,  it zeros its own count of "2", and
updates its future expected epoch to 1.

at this time, the state of system is:
node A     expected epoch =1  [A:nil] [B:nil]
same for node B



let's say we have following further writes:

node B  add 3  clock  1.400

node A adds 4  clock 1.500

node B receives A's add 4,   node B updates its copy of A
node A receives B's add 3,    updates its copy of B


then state is:
node A  , expected epoch == 1    [A:4  clock=400] [B:3   clock=500]
node B same



generally I think it should be complete if we add the following rule for
inter-leader replication:

each leader keeps a var in memory (and also persist to sstable when
flushing)  expected_epoch , initially set to 0

node P does:
on receiving updates from  node Q
        if Q.expected_epoch > P.expected_epoch
              /** an epoch bump inherently means a previous delete, which we
probably missed , so we need to apply the delete
                  a delete is global to all leaders, so apply it on all my
replicas **/
             for all leaders in my vector
                  count = nil

             P.expected_epoch =  Q.expected_epoch
        if Q.expected_epoch == P.expected_epoch
             update P's copy of Q according to standard rules
        /** if Q.expected_epoch < P.expected_epoch  , that means Q is less
up to date than us, just ignore


replicate_on_write(to Q):
      if  P.operation == delete
            P.expected_epoch ++
            set all my copies of all leaders to nil
      send to Q ( P.total , P.expected_epoch)




overall I don't think delete being not commutative is a fundamental blocker
: regular columns are also not commutative, yet we achieve stable result no
matter what order they are applied, because of the ordering rule used in
reconciliation; here we just need to find a similar ordering rule. the epoch
thing could be a step on this direction.


Thanks
Yang




On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jbel...@gmail.com> wrote:

> I don't think that's bulletproof either.  For instance, what if the
> two adds go to replica 1 but the delete to replica 2?
>
> Bottom line (and this was discussed on the original
> delete-for-counters ticket,
> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes
> are not fully commutative which makes them fragile.
>
> On Mon, Jun 13, 2011 at 10:54 AM, Yang <teddyyyy...@gmail.com> wrote:
> > as https://issues.apache.org/jira/browse/CASSANDRA-2101
> > indicates, the problem with counter delete is  in scenarios like the
> > following:
> > add 1, clock 100
> > delete , clock 200
> > add  2 , clock 300
> > if the 1st and 3rd operations are merged in SStable compaction, then we
> > have
> > delete  clock 200
> > add 3,  clock 300
> > which shows wrong result.
> >
> > I think a relatively simple extension can be used to complete fix this
> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the
> clock,
> > so that
> >    1) a delete operation increases future epoch number by 1
> >    2) merging of delta adds can be between only deltas of the same epoch,
> > deltas of older epoch are simply ignored during merging. merged result
> keeps
> > the epoch number of the newest seen.
> > other operations remain the same as current. note that the above 2 rules
> are
> > only concerned with merging within the deltas on the leader, and not
> related
> > to the replicated count, which is a simple final state, and observes the
> > rule of "larger clock trumps". naturally the ordering rule is:
> epoch1.clock1
> >> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 >
> clock2
> > intuitively "epoch" can be seen as the serial number on a new
> "incarnation"
> > of a counter.
> >
> > code change should be mostly localized to CounterColumn.reconcile(),
> >  although, if an update does not find existing entry in memtable, we need
> to
> > go to sstable to fetch any possible epoch number, so
> > compared to current write path, in the "no replicate-on-write" case, we
> need
> > to add a read to sstable. but in the "replicate-on-write" case, we
> already
> > read that, so it's no extra time cost.  "no replicate-on-write" is not a
> > very useful setup in reality anyway.
> >
> > does this sound a feasible way?   if this works, expiring counter should
> > also naturally work.
> >
> > Thanks
> > Yang
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Reply via email to