[
https://issues.apache.org/jira/browse/IMPALA-14228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18011815#comment-18011815
]
ASF subversion and git services commented on IMPALA-14228:
----------------------------------------------------------
Commit 73de6517a4a403edd569f1d79abda79332874fd4 in impala's branch
refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=73de6517a ]
IMPALA-14280: Deflake catalogd HA failover tests
Several tests on catalogd HA failover have a loop of the following
pattern:
- Do some operations
- Kills the active catalogd
- Verifies some results
- Starts the killed catalogd
After starting the killed catalogd, the test gets the new active and
standby catalogds and check their /healthz pages immediately. This could
fail if the web pages are not registered yet. The cause is when starting
catalogd, we just wait for its 'statestore-subscriber.connected' to be
True. This doesn't guarantee that the web pages are initialized. This
patch adds a wait for this, i.e. when getting the web pages hits 404
(Not Found) error, wait and retry.
Another flaky issue of these failover tests is cleanup unique_database
could fail due to impalad still using the old active catalogd address
even in RPC failure retries (IMPALA-14228). This patch adds a retry on
the DROP DATABASE statement to work around this.
Sets disable_log_buffering to True so the killed catalogd has complete
logs.
Sets catalog_client_connection_num_retries to 2 to save time in
coordinator retrying RPCs to the killed catalogd. This reduce the
duration of test_warmed_up_metadata_failover_catchup from 100s to 50s.
Tests:
- Ran all (15) failover tests in test_catalogd_ha.py 10 times (each
round takes 450s).
Change-Id: Iad42a55ed7c357ed98d85c69e16ff705a8cae89d
Reviewed-on: http://gerrit.cloudera.org:8080/23235
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Quanlong Huang <[email protected]>
> Coordinator should retry requests on the new active catalogd after HA failover
> ------------------------------------------------------------------------------
>
> Key: IMPALA-14228
> URL: https://issues.apache.org/jira/browse/IMPALA-14228
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Quanlong Huang
> Assignee: Quanlong Huang
> Priority: Major
>
> During catalogd HA failover, the active catalogd address is changed. RPCs
> that coordinator sent to the previous active catalogd will fail (e.g. when it
> crashes), which causing query failures. They should be retried on the current
> active catalogd.
> However, though coordinator will retry the RPC, they keep using the previous
> active catalogd address. E.g. in PrioritizeLoad RPCs:
> {code:cpp}
> Status CatalogOpExecutor::PrioritizeLoad(const TPrioritizeLoadRequest& req,
> TPrioritizeLoadResponse* result) {
> int attempt = 0; // Used for debug action only.
> CatalogServiceConnection::RpcStatus rpc_status =
> CatalogServiceConnection::DoRpcWithRetry(
> env_->catalogd_lightweight_req_client_cache(),
> *ExecEnv::GetInstance()->GetCatalogdAddress().get(), // a fixed
> address is used during retry
> {code}
> Due to this, test_warmed_up_metadata_after_failover is still flaky after
> fixing IMPALA-14227.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]