[jira] [Commented] (IMPALA-14228) Coordinator should retry requests on the new active catalogd after HA failover

ASF subversion and git services (Jira) Mon, 04 Aug 2025 02:14:06 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-14228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18011815#comment-18011815
 ]


ASF subversion and git services commented on IMPALA-14228:
----------------------------------------------------------

Commit 73de6517a4a403edd569f1d79abda79332874fd4 in impala's branch 
refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=73de6517a ]

IMPALA-14280: Deflake catalogd HA failover tests

Several tests on catalogd HA failover have a loop of the following
pattern:
 - Do some operations
 - Kills the active catalogd
 - Verifies some results
 - Starts the killed catalogd
After starting the killed catalogd, the test gets the new active and
standby catalogds and check their /healthz pages immediately. This could
fail if the web pages are not registered yet. The cause is when starting
catalogd, we just wait for its 'statestore-subscriber.connected' to be
True. This doesn't guarantee that the web pages are initialized. This
patch adds a wait for this, i.e. when getting the web pages hits 404
(Not Found) error, wait and retry.

Another flaky issue of these failover tests is cleanup unique_database
could fail due to impalad still using the old active catalogd address
even in RPC failure retries (IMPALA-14228). This patch adds a retry on
the DROP DATABASE statement to work around this.

Sets disable_log_buffering to True so the killed catalogd has complete
logs.

Sets catalog_client_connection_num_retries to 2 to save time in
coordinator retrying RPCs to the killed catalogd. This reduce the
duration of test_warmed_up_metadata_failover_catchup from 100s to 50s.

Tests:
 - Ran all (15) failover tests in test_catalogd_ha.py 10 times (each
   round takes 450s).

Change-Id: Iad42a55ed7c357ed98d85c69e16ff705a8cae89d
Reviewed-on: http://gerrit.cloudera.org:8080/23235
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Quanlong Huang <[email protected]>


> Coordinator should retry requests on the new active catalogd after HA failover
> ------------------------------------------------------------------------------
>
>                 Key: IMPALA-14228
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14228
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Major
>
> During catalogd HA failover, the active catalogd address is changed. RPCs 
> that coordinator sent to the previous active catalogd will fail (e.g. when it 
> crashes), which causing query failures. They should be retried on the current 
> active catalogd.
> However, though coordinator will retry the RPC, they keep using the previous 
> active catalogd address. E.g. in PrioritizeLoad RPCs:
> {code:cpp}
> Status CatalogOpExecutor::PrioritizeLoad(const TPrioritizeLoadRequest& req,
>     TPrioritizeLoadResponse* result) {
>   int attempt = 0; // Used for debug action only.
>   CatalogServiceConnection::RpcStatus rpc_status =
>       CatalogServiceConnection::DoRpcWithRetry(
>           env_->catalogd_lightweight_req_client_cache(),
>           *ExecEnv::GetInstance()->GetCatalogdAddress().get(), // a fixed 
> address is used during retry
> {code}
> Due to this, test_warmed_up_metadata_after_failover is still flaky after 
> fixing IMPALA-14227.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-14228) Coordinator should retry requests on the new active catalogd after HA failover

Reply via email to