[ https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791801#comment-17791801 ]
Stefan Miklosovic edited comment on CASSANDRA-19120 at 11/30/23 9:54 PM: ------------------------------------------------------------------------- If we have a keyspace with RF 3 and we do a read with CL LOCAL_QUORUM, then this (looking into trunk version of that) {code:java} int adjustedBlockFor = repairPlan.writeQuorum(); {code} will be set here (1). writeQuorum is computed in the constructor above, and that is computed in (2). It will firstly compute blockFor from the consistency level for LOCAL_QUORUM when RF is 3 which will be 2 (line 176) and since we have LOCAL_QUORUM it will go to the case where it does {code:java} blockFor += pending.count(InOurDc.replicas()); {code} and since we are assuming that pending is not in local DC, it will add 0 to 2 which is 2 so yeah, adjustedBlockFor will be 2. We are on the same page here. Now this: {code:java} int adjustedBlockFor = repairPlan.writeQuorum(); for (Replica participant : repairPlan.contacts()) { if (!repairs.containsKey(participant) && shouldBlockOn.test(participant.endpoint())) adjustedBlockFor--; } this.blockFor = adjustedBlockFor; {code} This is a little bit confusing to go through but in order to decrement adjustedBlockFor (which is 2 at the beginning of this loop), any participant of the repairPlan (contacts) has to be in the local dc (in order to have shouldBlockOn returning true when it is InOurDc.endpoints()). repairPlan.contacts() in that loop will ultimately take these contacts from (3). In that selector, (4) {code:java} int add = consistencyLevel.blockForWrite(liveAndDown.replicationStrategy(), liveAndDown.pending()) - contacts.size(); {code} blockForWrite will be called with strategy with RF = 3, LOCAL_QUORUM is 2, pending might be one, so it will do again "blockFor += pending.count(InOurDc.replicas());" which will be 2 += 0 = 2. In order to have add at least 1, when blockForWrite returns 2, contacts.size() has to be 1 (2 - 1 = 1). Then it goes through live replicas and it will choose some which was not contacted {*}and here it might pick remote replica{*}. so if we return to that "if" {code:java} if (!repairs.containsKey(participant) && shouldBlockOn.test(participant.endpoint())) adjustedBlockFor--; {code} participant can be remote, so "repairs.containsKey(participant)" is "false" (when repairs are all local), so "!repairs.containsKey(participant)" is true. Then the second clause, shouldBlock, will evaluate to false, because participant is not local and shouldBlockOn is InOurDc.endpoints. So true && false = false. So "adjustedBlockFor" will not be decremented. But I think that it should be, because we are not going to block for remote participants. If it is not decremented, then blockFor will be set to 2 but it should be set only to 1. If it is set to 2, then this {code:java} void ack(InetAddressAndPort from) { if (shouldBlockOn.test(from)) { pendingRepairs.remove(repairPlan.lookup(from)); latch.decrement(); } } {code} will decrement it just once, from 2 to 1 and it will never reach 0. So it seems to me that we should do this: {code:java} if (!repairs.containsKey(participant) && !shouldBlockOn.test(participant.endpoint())) adjustedBlockFor--; {code} Notice the negation for shouldBlockOn. We should not block on remotes. shouldBlockOn will return true when participant is local. In order to not block on remote participants, (!shouldNotBlockOn) should be true, that is only when shouldNotBlockOn is false. That is only when participant is remote one. Or maybe I am completely wrong? :) Would you guys be so nice to go through this and proof read it? (1) [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlan.java#L326-L329] (2) [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L181-L184] (3) [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L577] (4) [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L572] was (Author: smiklosovic): If we have a keyspace with RF 3 and we do a read with CL LOCAL_QUORUM, then this (looking into trunk version of that) {code:java} int adjustedBlockFor = repairPlan.writeQuorum(); {code} will be set here (1). writeQuorum is computed in the constructor above, and that is computed in (2). It will firstly compute blockFor from the consistency level for LOCAL_QUORUM when RF is 3 which will be 2 (line 176) and since we have LOCAL_QUORUM it will go to the case where it does {code:java} blockFor += pending.count(InOurDc.replicas()); {code} and since we are assuming that pending is not in local DC, it will add 0 to 2 which is 2 so yeah, adjustedBlockFor will be 2. We are on the same page here. Now this: {code:java} int adjustedBlockFor = repairPlan.writeQuorum(); for (Replica participant : repairPlan.contacts()) { if (!repairs.containsKey(participant) && shouldBlockOn.test(participant.endpoint())) adjustedBlockFor--; } this.blockFor = adjustedBlockFor; {code} This is a little bit confusing to go through but in order to decrement adjustedBlockFor (which is 2 at the beginning of this loop), any participant of the repairPlan (contacts) has to be in the local dc (in order to have shouldBlockOn returning true when it is InOurDc.endpoints()). repairPlan.contacts() in that loop will ultimately take these contacts from (3). In that selector, (4) {code:java} int add = consistencyLevel.blockForWrite(liveAndDown.replicationStrategy(), liveAndDown.pending()) - contacts.size(); {code} blockForWrite will be called with strategy with RF = 3, LOCAL_QUORUM is 2, pending might be one, so it will do again "blockFor += pending.count(InOurDc.replicas());" which will be 2 += 0 = 2. In order to have add at least 1, when blockForWrite returns 2, contacts.size() has to be 1 (2 - 1 = 1). Then it goes through live replicas and it will choose some which was not contacted {*}and here it might pick remote replica{*}. so if we return to that "if" {code:java} if (!repairs.containsKey(participant) && shouldBlockOn.test(participant.endpoint())) adjustedBlockFor--; {code} participant can be remote, so "repairs.containsKey(participant)" is "false" (when repairs are all local), so "!repairs.containsKey(participant)" is true. Then the second clause, shouldBlock, will evaluate to false, because participant is not local and shouldBlockOn is InOurDc.endpoints. So true && false = false. So "adjustedBlockFor" will not be decremented. But I think that it should be, because we are not going to block for remote participants. If it is not decremented, then blockFor will be set to 2 but it should be set only to 1. If it is set to 2, then this {code:java} void ack(InetAddressAndPort from) { if (shouldBlockOn.test(from)) { pendingRepairs.remove(repairPlan.lookup(from)); latch.decrement(); } } {code} will increment it just once, from 2 to 1 and it will never reach 0. So it seems to me that we should do this: {code:java} if (!repairs.containsKey(participant) && !shouldBlockOn.test(participant.endpoint())) adjustedBlockFor--; {code} Notice the negation for shouldBlockOn. We should not block on remotes. shouldBlockOn will return true when participant is local. In order to not block on remote participants, (!shouldNotBlockOn) should be true, that is only when shouldNotBlockOn is false. That is only when participant is remote one. Or maybe I am completely wrong? :) Would you guys be so nice to go through this and proof read it? (1) [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlan.java#L326-L329] (2) [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L181-L184] (3) [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L577] (4) [https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L572] > local consistencies may get timeout if blocking read repair is sending the > read repair mutation to other DC > ------------------------------------------------------------------------------------------------------------ > > Key: CASSANDRA-19120 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19120 > Project: Cassandra > Issue Type: Bug > Reporter: Runtian Liu > Priority: Normal > Attachments: image-2023-11-29-15-26-08-056.png, signature.asc > > > For a two DCs cluster setup. When a new node is being added to DC1, for > blocking read repair triggered by local_quorum in DC1, it will require to > send read repair mutation to an extra node(1)(2). The selector for read > repair may select *ANY* node that has not been contacted before(3) instead of > selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% > timeout because of the bug described below: > When we initialized the latch(4) for blocking read repair, the shouldBlockOn > function will only return true for local nodes(5), the blockFor value will be > reduced if a local node doesn't require repair(6). The blockFor is same as > the number of read repair mutation sent out. But when the coordinator node > receives the response from the target nodes, the latch only count down for > nodes in same DC(7). The latch will wait till timeout and the read request > will timeout. > This can be reproduced if you have a constant load on a 3 + 3 cluster when > adding a node. If you have someway to trigger blocking read repair(maybe by > adding load using stress tool). If you use local_quorum consistency with a > constant read after write load in the same DC that you are adding node. You > will see read timeout issue from time to time because of the bug described > above > > I think for read repair when selecting the extra node to do repair, we should > prefer local nodes than the nodes from other region. Also, we need to fix the > latch part so even if we send mutation to the nodes in other DC, we don't get > a timeout. > (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455] > (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183] > (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458] > (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96] > (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71] > (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88] > (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113] > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org