[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

Riza Suminto (Code Review) Fri, 14 Jun 2024 14:03:05 -0700

Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )


Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
......................................................................


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/21520/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21520/1//COMMIT_MSG@12
PS1, Line 12: y defining a post-recovery grace period. During
            : the grace period, don't update the current cluster membership so 
th
> Define a post-recovery grace period after statestore has been disconnected
Thanks!
Follow up question: after grace period is over, is it the case that StateStore 
will send new cluster membership update to Coordinator and Coordinator will 
cancel the query because cluster membership has changed since query start 
RUNNING?


http://gerrit.cloudera.org:8080/#/c/21520/1/be/src/statestore/statestore-subscriber.cc
File be/src/statestore/statestore-subscriber.cc:

http://gerrit.cloudera.org:8080/#/c/21520/1/be/src/statestore/statestore-subscriber.cc@1095
PS1, Line 1095: bool has_disconnect_before = 
connection_failure_metric_->GetValue() > 0;
              :   bool in_disconnect_grace_period = 
MilliSecondsSinceLastRegistration()
              :       < FLAGS_statestore_subscriber_recovery_grace_period_ms;
> Renamed two variables.
Done


http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@653
PS1, Line 653: """Test that a momentary inconsistent cluster membership state 
after statestore
             :     service fail-over will not result in query cancellation. 
Also make sure that query
             :     get cancelled if a backend actually went down."""
Is there an existing opposite test that shows query cancelled after grace 
period due to cluster membership difference?


http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@688
PS1, Line 688: # Now kill a backend, and make sure the query fails.
> Finishing this query takes about 120 seconds on my local machine. It seems
In that case, is it better to cancel through query handle instead?

client.cancel(handle)

In this test, is the cancelation reason because cluster membership change due 
to stopped backend, or just because row transmission broken?



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 2
Gerrit-Owner: Wenzhe Zhou <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Comment-Date: Fri, 14 Jun 2024 20:58:54 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

Reply via email to