[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

Wenzhe Zhou (Code Review) Fri, 14 Jun 2024 16:52:08 -0700

Wenzhe Zhou has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )


Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
......................................................................


Patch Set 3:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/21520/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21520/1//COMMIT_MSG@12
PS1, Line 12: y defining a post-recovery grace period. During
            : the grace period, don't update the current cluster membership so 
th
> Thanks!
Active statestored keeps sending cluster membership periodically to all 
subscribers, including coordinators/executors. Sending intervals are 100 ms by 
default. After grace period, the collected cluster membership should be 
consistent and it's safe to use the current cluster membership to detect if a 
node is unhealthy and cancel running queries which are assigned to unhealthy 
nodes.


http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.h
File be/src/statestore/statestore-subscriber.h:

http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.h@345
PS2, Line 345:       int64_t time_ms = MonotonicMillis() - 
last_failover_time_.Load();
> nit: Might be nice to DCHECK that the result is not negative.
Added DCHECK


http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.cc
File be/src/statestore/statestore-subscriber.cc:

http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.cc@1098
PS2, Line 1098:   bool in_failover_grace_period = 
MilliSecondsSinceLastFailover()
> Do we really need to check this? If it's 0, then MilliSecondsSinceLastFailo
Right, removed has_failover_before


http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@653
PS1, Line 653: """Test that a momentary inconsistent cluster membership state 
after statestore
             :     service fail-over will not result in query cancellation. 
Also make sure that query
             :     get cancelled if a backend actually went down aft
> Is there an existing opposite test that shows query cancelled after grace p
Second part of this test case shows query cancelled after grace period due to 
cluster membership change.
Updated comments.


http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@688
PS1, Line 688: # Now kill a backend, and make sure the query fails.
> In that case, is it better to cancel through query handle instead?
Cancellation reason is cluster membership change after grace period.


http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@688
PS1, Line 688: # Now kill a backend, and make sure the query fails.
> What will happen if, after client.execute_async(slow_query), you start 4th
The query will not be affected if 4th impalad has no scheduled task.


http://gerrit.cloudera.org:8080/#/c/21520/2/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/2/tests/custom_cluster/test_statestored_ha.py@692
PS2, Line 692:         assert False, "Query expected to fail"
> Can we verify that it had to wait longer than the CANCELLATION_GRACE_PERIOD
Yes, need to add another test case.



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 3
Gerrit-Owner: Wenzhe Zhou <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Comment-Date: Fri, 14 Jun 2024 23:51:47 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

Reply via email to