This is an automated email from the ASF dual-hosted git repository.
stigahuang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git
The following commit(s) were added to refs/heads/master by this push:
new 9d6997b7c IMPALA-14280: (Addendum) Waits for updating active catalogd
address
9d6997b7c is described below
commit 9d6997b7c00512295401f815630e8f02876ecb74
Author: stiga-huang <[email protected]>
AuthorDate: Tue Aug 5 22:03:46 2025 +0800
IMPALA-14280: (Addendum) Waits for updating active catalogd address
Some tests for catalogd HA failover have a lightweight verifier function
that finishes quickly before coordinator notices catalogd HA failover,
e.g. when the verifier function runs a statement that doesn't trigger
catalogd RPCs.
If the test finishes in such a state, coordinator will use the stale
active catalogd address in cleanup, i.e. dropping unique_database, and
fails quickly since the catalogd is passive now. Retrying the statement
immediately usually won't help since coordinator hasn't updated the
active catalogd address yet.
Note that we also retry the verifier function immediately when it's
failed by coordinator talking to the stale catalogd address. It works
since the previous active catalogd is not running so the catalogd RPCs
fail and got retried. The retry interval is 3s (configured by
catalog_client_rpc_retry_interval_ms) and we retry it for at least 2
times (customized by catalog_client_connection_num_retries in the
tests). The duration is usually enough for coordinator to update the
active catalogd address. But depending on this duration is a bit tricky.
This patch adds a wait before the verifier function to make sure
coordinator has updated the active catalogd address. This also make sure
the cleanup of unique_database won't fail due to stale active catalogd
address.
Tests:
- Ran test_catalogd_ha.py
Change-Id: I45e4a20170fdcce8282f1762f81a290689777aed
Reviewed-on: http://gerrit.cloudera.org:8080/23252
Reviewed-by: Riza Suminto <[email protected]>
Reviewed-by: Wenzhe Zhou <[email protected]>
Tested-by: Quanlong Huang <[email protected]>
---
tests/custom_cluster/test_catalogd_ha.py | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tests/custom_cluster/test_catalogd_ha.py
b/tests/custom_cluster/test_catalogd_ha.py
index e5bda08f3..b149d2a44 100644
--- a/tests/custom_cluster/test_catalogd_ha.py
+++ b/tests/custom_cluster/test_catalogd_ha.py
@@ -769,6 +769,11 @@ class TestCatalogdHA(CustomClusterTestSuite):
assert catalogd_service_2.get_metric_value(
"catalog-server.ha-number-active-status-change") > 0
assert catalogd_service_2.get_metric_value("catalog-server.active-status")
+ # Make sure coordinator has updated the active catalogd address.
+ self.cluster.get_first_impalad().service.wait_for_metric_value(
+ "catalog.active-catalogd-address",
+ expected_value="{}:{}".format(catalogd_service_2.hostname,
+ catalogd_service_2.service_port))
for i in range(2):
try:
@@ -784,6 +789,7 @@ class TestCatalogdHA(CustomClusterTestSuite):
continue
assert False, str(e)
+ # Recover the cluster to have two catalogds
active_catalogd.start()
return standby_catalogd, active_catalogd