> 1. It seems that for example when RF=3, each one of the three base replicas > will send a view update to the fourth "pending node". While this is not > wrong, it's also inefficient - why send three copies of the same update? > Wouldn't it be more efficient that just one of the base replicas - the one > which eventually will be paired with the pending node - should send the > updates to it? Is there a problem with such a scheme?
This optimization can be done when there's a single pending range per view replica set, but when there are multiple pending ranges and there are failures, it's possible that the paired view replica changes what can lead to missing updates. For instance, see the following scenario: - There are 2 pending ranges A' and B'. - Base replica A sends update to pending-paired view replica A'. - Base replica B is down, so pending-paired view replica B' does not get update. - Range movement A' fails and B' succeeds. - B' becomes A new paired view replica. - A will be out of sync with B' Furthermore we would need to cache the ring state after the range movement is completed to be able to compute the pending-paired view replica but we don't have this info easily available currently, so it seems that it would not be a trivial change but perhaps worth pursuing in the single pending range case. > 2. There's an optimization that when we're lucky enough that the paired view > replica is the same as this base replica, mutateMV doesn't use the normal > view-mutation-sending code (wrapViewBatchResponseHandler) and just writes the > mutation locally. In particular, in this case we do NOT write to the pending > node (unless I'm missing something). But, sometimes all replicas will be > paired with themselves - this can happen for example when number of nodes is > equal to RF, or when the base and view table have the same partition keys > (but different clustering keys). In this case, it seems the pending node will > not be written at all... Isn't this a bug? Good catch! This indeed seems to be a regression caused by CASSANDRA-13069, so I created CASSANDRA-14251 to restore the correct behavior. bq. Being paired with yourself is not only a "trick", but also something which really happens (by chance or in some cases as I showed above, always), and needs to be handled correctly, even if the cluster grows. If none of the base replicas will send the view update to the pending node, it will end up missing this update... Exactly, I only considered the case where the local address was used as a marker to indicate there was no paired endpoint, and brainfarted/missed the more important case when the local node is the paired endpoint and there is a pending endpoint. Unfortunately this overlook was not catch by any tests, so I will add one on CASSANDRA-14251. 2018-02-21 11:26 GMT-03:00 Nadav Har'El <n...@scylladb.com>: > Hi, I was trying to understand how view tables are updated during a period > of range movements, namely bootstrapping of a new node or decommissioning > one of the nodes. In particular, during the period of data streaming, we can > have a new replica on a "pending node" to which we also need to send the > view update. > > I looked at the mutateMV() code, and think I spotted two issues with it, and > I wonder if I'm missing something or these are real problems: > > 1. It seems that for example when RF=3, each one of the three base replicas > will send a view update to the fourth "pending node". While this is not > wrong, it's also inefficient - why send three copies of the same update? > Wouldn't it be more efficient that just one of the base replicas - the one > which eventually will be paired with the pending node - should send the > updates to it? Is there a problem with such a scheme? > > 2. There's an optimization that when we're lucky enough that the paired view > replica is the same as this base replica, mutateMV doesn't use the normal > view-mutation-sending code (wrapViewBatchResponseHandler) and just writes > the mutation locally. In particular, in this case we do NOT write to the > pending node (unless I'm missing something). But, sometimes all replicas > will be paired with themselves - this can happen for example when number of > nodes is equal to RF, or when the base and view table have the same > partition keys (but different clustering keys). In this case, it seems the > pending node will not be written at all... Isn't this a bug? > > The strange thing about issue 2 is that this code used to be correct (at > least according to my understanding...) - it used to avoid this optimization > if pendingNodes was not empty. But then this was changed in commit > 12103653f31. Why? > https://issues.apache.org/jira/browse/CASSANDRA-13069 contains an > explanation to that change: > "I also removed the pendingEndpoints.isEmpty() condition to skip the > batchlog for local mutations, since this was a pre-CASSANDRA-10674 leftover > when ViewUtils.getViewNaturalEndpoint returned the local address to force > non-paired replicas to be written to the batchlog." (Paulo Motta, 21/Dec/16) > > But I don't understand this explanation... Being paired with yourself is not > only a "trick", but also something which really happens (by chance or in > some cases as I showed above, always), and needs to be handled correctly, > even if the cluster grows. If none of the base replicas will send the view > update to the pending node, it will end up missing this update... > > Thanks, > Nadav. > > > -- > Nadav Har'El > n...@scylladb.com --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org