Re: one way to make counter delete work better

Yang Tue, 14 Jun 2011 23:34:37 -0700

patch in https://issues.apache.org/jira/browse/CASSANDRA-2774


<https://issues.apache.org/jira/browse/CASSANDRA-2774>some coding is messy
and only intended for demonstration only, we could refine it after we agree
this is a feasible way to go.


Thanks
Yang

On Tue, Jun 14, 2011 at 11:21 AM, Sylvain Lebresne <sylv...@datastax.com>wrote:

> Who assigns those epoch numbers ?
> You need all nodes to agree on the epoch number somehow to have this work,
> but then how do you maintain those in a partition tolerant distributed
> system ?
>
> I may have missed some parts of your proposal but let me consider a
> scenario
> that we have to be able to handle: consider two nodes A and B (RF=2) each
> in
> one data center (DCA and DCB) and a counter c. Suppose you do a +2
> increment
> on c that both nodes get. Now let say you have a network split and the
> connection
> between your 2 data center fails. In DCA you delete c, only A gets it.
> In DCB, you
> do more increments on c (say +3), only B gets it. The partition can
> last for hours.
> For deletion to work, we would need that whenever the network
> partition is resolved,
> both node eventually agree on the value 3 (i.e, only the second increment).
> I don't see how you could assign epoch numbers or anything to fix that.
>
> --
> Sylvain
>
> On Mon, Jun 13, 2011 at 8:26 PM, Yang <teddyyyy...@gmail.com> wrote:
> > ok, I think it's better to understand it this way, then it is really
> simple
> > and intuitive:
> > my proposed way of counter update can be simply seen as a combination of
> > regular columns + current counter columns:
> > regular column :  [ value: "wipes out every bucket to nil"   , clock:
> epoch
> > number]
> > then within each epoch, counter updates work as currently implemented
> >
> >
> > On Mon, Jun 13, 2011 at 10:12 AM, Yang <teddyyyy...@gmail.com> wrote:
> >>
> >> I think this approach also works for your scenario:
> >> I thought that the issue is only concerned with merging within the same
> >> leader; but you pointed out
> >> that a similar merging happens between leaders too, now I see that the
> >> same rules on epoch number
> >> also applies to inter-leader data merging, specifically in your case:
> >>
> >> everyone starts with epoch of 0, ( they should be same, if not, it also
> >> works, we just consider them to be representing diffferent time
> snapshots of
> >> the same counter state)
> >> node A      add 1    clock:  0.100  (epoch = 0, clock number = 100)
> >> node A      delete    clock:  0.200
> >> node B     add 2     clock:  0.300
> >> node A    gets B's state:  add 2 clock 0.300, but rejects it because A
> has
> >> already produced a delete, with epoch of 0, so A considers epoch 0
> already
> >> ended, it won't accept any replicated state with epoch < 1.
> >> node B    gets A's delete  0.200,  it zeros its own count of "2", and
> >> updates its future expected epoch to 1.
> >> at this time, the state of system is:
> >> node A     expected epoch =1  [A:nil] [B:nil]
> >> same for node B
> >>
> >>
> >> let's say we have following further writes:
> >> node B  add 3  clock  1.400
> >> node A adds 4  clock 1.500
> >> node B receives A's add 4,   node B updates its copy of A
> >> node A receives B's add 3,    updates its copy of B
> >>
> >> then state is:
> >> node A  , expected epoch == 1    [A:4  clock=400] [B:3   clock=500]
> >> node B same
> >>
> >>
> >> generally I think it should be complete if we add the following rule for
> >> inter-leader replication:
> >> each leader keeps a var in memory (and also persist to sstable when
> >> flushing)  expected_epoch , initially set to 0
> >> node P does:
> >> on receiving updates from  node Q
> >>         if Q.expected_epoch > P.expected_epoch
> >>               /** an epoch bump inherently means a previous delete,
> which
> >> we probably missed , so we need to apply the delete
> >>                   a delete is global to all leaders, so apply it on all
> my
> >> replicas **/
> >>              for all leaders in my vector
> >>                   count = nil
> >>
> >>              P.expected_epoch =  Q.expected_epoch
> >>         if Q.expected_epoch == P.expected_epoch
> >>              update P's copy of Q according to standard rules
> >>         /** if Q.expected_epoch < P.expected_epoch  , that means Q is
> less
> >> up to date than us, just ignore
> >>
> >> replicate_on_write(to Q):
> >>       if  P.operation == delete
> >>             P.expected_epoch ++
> >>             set all my copies of all leaders to nil
> >>       send to Q ( P.total , P.expected_epoch)
> >>
> >>
> >>
> >> overall I don't think delete being not commutative is a fundamental
> >> blocker : regular columns are also not commutative, yet we achieve
> stable
> >> result no matter what order they are applied, because of the ordering
> rule
> >> used in reconciliation; here we just need to find a similar ordering
> rule.
> >> the epoch thing could be a step on this direction.
> >>
> >> Thanks
> >> Yang
> >>
> >>
> >>
> >> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jbel...@gmail.com>
> wrote:
> >>>
> >>> I don't think that's bulletproof either.  For instance, what if the
> >>> two adds go to replica 1 but the delete to replica 2?
> >>>
> >>> Bottom line (and this was discussed on the original
> >>> delete-for-counters ticket,
> >>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes
> >>> are not fully commutative which makes them fragile.
> >>>
> >>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <teddyyyy...@gmail.com> wrote:
> >>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101
> >>> > indicates, the problem with counter delete is  in scenarios like the
> >>> > following:
> >>> > add 1, clock 100
> >>> > delete , clock 200
> >>> > add  2 , clock 300
> >>> > if the 1st and 3rd operations are merged in SStable compaction, then
> we
> >>> > have
> >>> > delete  clock 200
> >>> > add 3,  clock 300
> >>> > which shows wrong result.
> >>> >
> >>> > I think a relatively simple extension can be used to complete fix
> this
> >>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the
> >>> > clock,
> >>> > so that
> >>> >    1) a delete operation increases future epoch number by 1
> >>> >    2) merging of delta adds can be between only deltas of the same
> >>> > epoch,
> >>> > deltas of older epoch are simply ignored during merging. merged
> result
> >>> > keeps
> >>> > the epoch number of the newest seen.
> >>> > other operations remain the same as current. note that the above 2
> >>> > rules are
> >>> > only concerned with merging within the deltas on the leader, and not
> >>> > related
> >>> > to the replicated count, which is a simple final state, and observes
> >>> > the
> >>> > rule of "larger clock trumps". naturally the ordering rule is:
> >>> > epoch1.clock1
> >>> >> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 >
> >>> >> clock2
> >>> > intuitively "epoch" can be seen as the serial number on a new
> >>> > "incarnation"
> >>> > of a counter.
> >>> >
> >>> > code change should be mostly localized to CounterColumn.reconcile(),
> >>> >  although, if an update does not find existing entry in memtable, we
> >>> > need to
> >>> > go to sstable to fetch any possible epoch number, so
> >>> > compared to current write path, in the "no replicate-on-write" case,
> we
> >>> > need
> >>> > to add a read to sstable. but in the "replicate-on-write" case, we
> >>> > already
> >>> > read that, so it's no extra time cost.  "no replicate-on-write" is
> not
> >>> > a
> >>> > very useful setup in reality anyway.
> >>> >
> >>> > does this sound a feasible way?   if this works, expiring counter
> >>> > should
> >>> > also naturally work.
> >>> >
> >>> > Thanks
> >>> > Yang
> >>>
> >>>
> >>>
> >>> --
> >>> Jonathan Ellis
> >>> Project Chair, Apache Cassandra
> >>> co-founder of DataStax, the source for professional Cassandra support
> >>> http://www.datastax.com
> >>
> >
> >
>

Re: one way to make counter delete work better

Reply via email to