Wenzhe Zhou has posted comments on this change. ( http://gerrit.cloudera.org:8080/21520 )
Change subject: IMPALA-13159: Fix query cancellation caused by statestore failover ...................................................................... Patch Set 3: (7 comments) http://gerrit.cloudera.org:8080/#/c/21520/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21520/1//COMMIT_MSG@12 PS1, Line 12: y defining a post-recovery grace period. During : the grace period, don't update the current cluster membership so th > Thanks! Active statestored keeps sending cluster membership periodically to all subscribers, including coordinators/executors. Sending intervals are 100 ms by default. After grace period, the collected cluster membership should be consistent and it's safe to use the current cluster membership to detect if a node is unhealthy and cancel running queries which are assigned to unhealthy nodes. http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.h File be/src/statestore/statestore-subscriber.h: http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.h@345 PS2, Line 345: int64_t time_ms = MonotonicMillis() - last_failover_time_.Load(); > nit: Might be nice to DCHECK that the result is not negative. Added DCHECK http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.cc File be/src/statestore/statestore-subscriber.cc: http://gerrit.cloudera.org:8080/#/c/21520/2/be/src/statestore/statestore-subscriber.cc@1098 PS2, Line 1098: bool in_failover_grace_period = MilliSecondsSinceLastFailover() > Do we really need to check this? If it's 0, then MilliSecondsSinceLastFailo Right, removed has_failover_before http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py File tests/custom_cluster/test_statestored_ha.py: http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@653 PS1, Line 653: """Test that a momentary inconsistent cluster membership state after statestore : service fail-over will not result in query cancellation. Also make sure that query : get cancelled if a backend actually went down aft > Is there an existing opposite test that shows query cancelled after grace p Second part of this test case shows query cancelled after grace period due to cluster membership change. Updated comments. http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@688 PS1, Line 688: # Now kill a backend, and make sure the query fails. > In that case, is it better to cancel through query handle instead? Cancellation reason is cluster membership change after grace period. http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@688 PS1, Line 688: # Now kill a backend, and make sure the query fails. > What will happen if, after client.execute_async(slow_query), you start 4th The query will not be affected if 4th impalad has no scheduled task. http://gerrit.cloudera.org:8080/#/c/21520/2/tests/custom_cluster/test_statestored_ha.py File tests/custom_cluster/test_statestored_ha.py: http://gerrit.cloudera.org:8080/#/c/21520/2/tests/custom_cluster/test_statestored_ha.py@692 PS2, Line 692: assert False, "Query expected to fail" > Can we verify that it had to wait longer than the CANCELLATION_GRACE_PERIOD Yes, need to add another test case. -- To view, visit http://gerrit.cloudera.org:8080/21520 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b Gerrit-Change-Number: 21520 Gerrit-PatchSet: 3 Gerrit-Owner: Wenzhe Zhou <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Wenzhe Zhou <[email protected]> Gerrit-Comment-Date: Fri, 14 Jun 2024 23:51:47 +0000 Gerrit-HasComments: Yes
