Riza Suminto has posted comments on this change. ( http://gerrit.cloudera.org:8080/21520 )
Change subject: IMPALA-13159: Fix query cancellation caused by statestore failover ...................................................................... Patch Set 2: (4 comments) http://gerrit.cloudera.org:8080/#/c/21520/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21520/1//COMMIT_MSG@12 PS1, Line 12: y defining a post-recovery grace period. During : the grace period, don't update the current cluster membership so th > Define a post-recovery grace period after statestore has been disconnected Thanks! Follow up question: after grace period is over, is it the case that StateStore will send new cluster membership update to Coordinator and Coordinator will cancel the query because cluster membership has changed since query start RUNNING? http://gerrit.cloudera.org:8080/#/c/21520/1/be/src/statestore/statestore-subscriber.cc File be/src/statestore/statestore-subscriber.cc: http://gerrit.cloudera.org:8080/#/c/21520/1/be/src/statestore/statestore-subscriber.cc@1095 PS1, Line 1095: bool has_disconnect_before = connection_failure_metric_->GetValue() > 0; : bool in_disconnect_grace_period = MilliSecondsSinceLastRegistration() : < FLAGS_statestore_subscriber_recovery_grace_period_ms; > Renamed two variables. Done http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py File tests/custom_cluster/test_statestored_ha.py: http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@653 PS1, Line 653: """Test that a momentary inconsistent cluster membership state after statestore : service fail-over will not result in query cancellation. Also make sure that query : get cancelled if a backend actually went down.""" Is there an existing opposite test that shows query cancelled after grace period due to cluster membership difference? http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@688 PS1, Line 688: # Now kill a backend, and make sure the query fails. > Finishing this query takes about 120 seconds on my local machine. It seems In that case, is it better to cancel through query handle instead? client.cancel(handle) In this test, is the cancelation reason because cluster membership change due to stopped backend, or just because row transmission broken? -- To view, visit http://gerrit.cloudera.org:8080/21520 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b Gerrit-Change-Number: 21520 Gerrit-PatchSet: 2 Gerrit-Owner: Wenzhe Zhou <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Wenzhe Zhou <[email protected]> Gerrit-Comment-Date: Fri, 14 Jun 2024 20:58:54 +0000 Gerrit-HasComments: Yes
