> 1. It seems that for example when RF=3, each one of the three base replicas 
> will send a view update to the fourth "pending node". While this is not 
> wrong, it's also inefficient - why send three copies of the same update? 
> Wouldn't it be more efficient that just one of the base replicas - the one 
> which eventually will be paired with the pending node - should send the 
> updates to it? Is there a problem with such a scheme?

This optimization can be done when there's a single pending range per
view replica set, but when there are multiple pending ranges and there
are failures, it's possible that the paired view replica changes what
can lead to missing updates. For instance, see the following scenario:
- There are 2 pending ranges A' and B'.
- Base replica A sends update to pending-paired view replica A'.
- Base replica B is down, so pending-paired view replica B' does not get update.
- Range movement A' fails and B' succeeds.
- B' becomes A new paired view replica.
- A will be out of sync with B'

Furthermore we would need to cache the ring state after the range
movement is completed to be able to compute the pending-paired view
replica but we don't have this info easily available currently, so it
seems that it would not be a trivial change but perhaps worth pursuing
in the single pending range case.

> 2. There's an optimization that when we're lucky enough that the paired view 
> replica is the same as this base replica, mutateMV doesn't use the normal 
> view-mutation-sending code (wrapViewBatchResponseHandler) and just writes the 
> mutation locally. In particular, in this case we do NOT write to the pending 
> node (unless I'm missing something). But, sometimes all replicas will be 
> paired with themselves - this can happen for example when number of nodes is 
> equal to RF, or when the base and view table have the same partition keys 
> (but different clustering keys). In this case, it seems the pending node will 
> not be written at all... Isn't this a bug?

Good catch! This indeed seems to be a regression caused by
CASSANDRA-13069, so I created CASSANDRA-14251 to restore the correct
behavior.

bq. Being paired with yourself is not only a "trick", but also
something which really happens (by chance or in some cases as I showed
above, always), and needs to be handled correctly, even if the cluster
grows. If none of the base replicas will send the view update to the
pending node, it will end up missing this update...

Exactly, I only considered the case where the local address was used
as a marker to indicate there was no paired endpoint, and
brainfarted/missed the more important case when the local node is the
paired endpoint and there is a pending endpoint. Unfortunately this
overlook was not catch by any tests, so I will add one on
CASSANDRA-14251.

2018-02-21 11:26 GMT-03:00 Nadav Har'El <n...@scylladb.com>:
> Hi, I was trying to understand how view tables are updated during a period
> of range movements, namely bootstrapping of a new node or decommissioning
> one of the nodes. In particular, during the period of data streaming, we can
> have a new replica on a "pending node" to which we also need to send the
> view update.
>
> I looked at the mutateMV() code, and think I spotted two issues with it, and
> I wonder if I'm missing something or these are real problems:
>
> 1. It seems that for example when RF=3, each one of the three base replicas
> will send a view update to the fourth "pending node". While this is not
> wrong, it's also inefficient - why send three copies of the same update?
> Wouldn't it be more efficient that just one of the base replicas - the one
> which eventually will be paired with the pending node - should send the
> updates to it? Is there a problem with such a scheme?
>
> 2. There's an optimization that when we're lucky enough that the paired view
> replica is the same as this base replica, mutateMV doesn't use the normal
> view-mutation-sending code (wrapViewBatchResponseHandler) and just writes
> the mutation locally. In particular, in this case we do NOT write to the
> pending node (unless I'm missing something). But, sometimes all replicas
> will be paired with themselves - this can happen for example when number of
> nodes is equal to RF, or when the base and view table have the same
> partition keys (but different clustering keys). In this case, it seems the
> pending node will not be written at all... Isn't this a bug?
>
> The strange thing about issue 2 is that this code used to be correct (at
> least according to my understanding...) - it used to avoid this optimization
> if pendingNodes was not empty. But then this was changed in commit
> 12103653f31. Why?
> https://issues.apache.org/jira/browse/CASSANDRA-13069 contains an
> explanation to that change:
>      "I also removed the pendingEndpoints.isEmpty() condition to skip the
> batchlog for local mutations, since this was a pre-CASSANDRA-10674 leftover
> when ViewUtils.getViewNaturalEndpoint returned the local address to force
> non-paired replicas to be written to the batchlog." (Paulo Motta, 21/Dec/16)
>
> But I don't understand this explanation... Being paired with yourself is not
> only a "trick", but also something which really happens (by chance or in
> some cases as I showed above, always), and needs to be handled correctly,
> even if the cluster grows. If none of the base replicas will send the view
> update to the pending node, it will end up missing this update...
>
> Thanks,
> Nadav.
>
>
> --
> Nadav Har'El
> n...@scylladb.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Reply via email to