Re: one way to make counter delete work better

Sylvain Lebresne Tue, 14 Jun 2011 11:22:19 -0700

Who assigns those epoch numbers ?
You need all nodes to agree on the epoch number somehow to have this work,
but then how do you maintain those in a partition tolerant distributed system ?


I may have missed some parts of your proposal but let me consider a scenario
that we have to be able to handle: consider two nodes A and B (RF=2) each in
one data center (DCA and DCB) and a counter c. Suppose you do a +2 increment
on c that both nodes get. Now let say you have a network split and the
connection
between your 2 data center fails. In DCA you delete c, only A gets it.
In DCB, you
do more increments on c (say +3), only B gets it. The partition can
last for hours.
For deletion to work, we would need that whenever the network
partition is resolved,
both node eventually agree on the value 3 (i.e, only the second increment).
I don't see how you could assign epoch numbers or anything to fix that.

--
Sylvain

On Mon, Jun 13, 2011 at 8:26 PM, Yang <teddyyyy...@gmail.com> wrote:
> ok, I think it's better to understand it this way, then it is really simple
> and intuitive:
> my proposed way of counter update can be simply seen as a combination of
> regular columns + current counter columns:
> regular column :  [ value: "wipes out every bucket to nil"   , clock: epoch
> number]
> then within each epoch, counter updates work as currently implemented
>
>
> On Mon, Jun 13, 2011 at 10:12 AM, Yang <teddyyyy...@gmail.com> wrote:
>>
>> I think this approach also works for your scenario:
>> I thought that the issue is only concerned with merging within the same
>> leader; but you pointed out
>> that a similar merging happens between leaders too, now I see that the
>> same rules on epoch number
>> also applies to inter-leader data merging, specifically in your case:
>>
>> everyone starts with epoch of 0, ( they should be same, if not, it also
>> works, we just consider them to be representing diffferent time snapshots of
>> the same counter state)
>> node A      add 1    clock:  0.100  (epoch = 0, clock number = 100)
>> node A      delete    clock:  0.200
>> node B     add 2     clock:  0.300
>> node A    gets B's state:  add 2 clock 0.300, but rejects it because A has
>> already produced a delete, with epoch of 0, so A considers epoch 0 already
>> ended, it won't accept any replicated state with epoch < 1.
>> node B    gets A's delete  0.200,  it zeros its own count of "2", and
>> updates its future expected epoch to 1.
>> at this time, the state of system is:
>> node A     expected epoch =1  [A:nil] [B:nil]
>> same for node B
>>
>>
>> let's say we have following further writes:
>> node B  add 3  clock  1.400
>> node A adds 4  clock 1.500
>> node B receives A's add 4,   node B updates its copy of A
>> node A receives B's add 3,    updates its copy of B
>>
>> then state is:
>> node A  , expected epoch == 1    [A:4  clock=400] [B:3   clock=500]
>> node B same
>>
>>
>> generally I think it should be complete if we add the following rule for
>> inter-leader replication:
>> each leader keeps a var in memory (and also persist to sstable when
>> flushing)  expected_epoch , initially set to 0
>> node P does:
>> on receiving updates from  node Q
>>         if Q.expected_epoch > P.expected_epoch
>>               /** an epoch bump inherently means a previous delete, which
>> we probably missed , so we need to apply the delete
>>                   a delete is global to all leaders, so apply it on all my
>> replicas **/
>>              for all leaders in my vector
>>                   count = nil
>>
>>              P.expected_epoch =  Q.expected_epoch
>>         if Q.expected_epoch == P.expected_epoch
>>              update P's copy of Q according to standard rules
>>         /** if Q.expected_epoch < P.expected_epoch  , that means Q is less
>> up to date than us, just ignore
>>
>> replicate_on_write(to Q):
>>       if  P.operation == delete
>>             P.expected_epoch ++
>>             set all my copies of all leaders to nil
>>       send to Q ( P.total , P.expected_epoch)
>>
>>
>>
>> overall I don't think delete being not commutative is a fundamental
>> blocker : regular columns are also not commutative, yet we achieve stable
>> result no matter what order they are applied, because of the ordering rule
>> used in reconciliation; here we just need to find a similar ordering rule.
>> the epoch thing could be a step on this direction.
>>
>> Thanks
>> Yang
>>
>>
>>
>> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
>>>
>>> I don't think that's bulletproof either.  For instance, what if the
>>> two adds go to replica 1 but the delete to replica 2?
>>>
>>> Bottom line (and this was discussed on the original
>>> delete-for-counters ticket,
>>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes
>>> are not fully commutative which makes them fragile.
>>>
>>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <teddyyyy...@gmail.com> wrote:
>>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101
>>> > indicates, the problem with counter delete is  in scenarios like the
>>> > following:
>>> > add 1, clock 100
>>> > delete , clock 200
>>> > add  2 , clock 300
>>> > if the 1st and 3rd operations are merged in SStable compaction, then we
>>> > have
>>> > delete  clock 200
>>> > add 3,  clock 300
>>> > which shows wrong result.
>>> >
>>> > I think a relatively simple extension can be used to complete fix this
>>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the
>>> > clock,
>>> > so that
>>> >    1) a delete operation increases future epoch number by 1
>>> >    2) merging of delta adds can be between only deltas of the same
>>> > epoch,
>>> > deltas of older epoch are simply ignored during merging. merged result
>>> > keeps
>>> > the epoch number of the newest seen.
>>> > other operations remain the same as current. note that the above 2
>>> > rules are
>>> > only concerned with merging within the deltas on the leader, and not
>>> > related
>>> > to the replicated count, which is a simple final state, and observes
>>> > the
>>> > rule of "larger clock trumps". naturally the ordering rule is:
>>> > epoch1.clock1
>>> >> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 >
>>> >> clock2
>>> > intuitively "epoch" can be seen as the serial number on a new
>>> > "incarnation"
>>> > of a counter.
>>> >
>>> > code change should be mostly localized to CounterColumn.reconcile(),
>>> >  although, if an update does not find existing entry in memtable, we
>>> > need to
>>> > go to sstable to fetch any possible epoch number, so
>>> > compared to current write path, in the "no replicate-on-write" case, we
>>> > need
>>> > to add a read to sstable. but in the "replicate-on-write" case, we
>>> > already
>>> > read that, so it's no extra time cost.  "no replicate-on-write" is not
>>> > a
>>> > very useful setup in reality anyway.
>>> >
>>> > does this sound a feasible way?   if this works, expiring counter
>>> > should
>>> > also naturally work.
>>> >
>>> > Thanks
>>> > Yang
>>>
>>>
>>>
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of DataStax, the source for professional Cassandra support
>>> http://www.datastax.com
>>
>
>

Re: one way to make counter delete work better

Reply via email to