[
https://issues.apache.org/jira/browse/IMPALA-14228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18005548#comment-18005548
]
Quanlong Huang commented on IMPALA-14228:
-----------------------------------------
The test failure is
{code:python}
tests/custom_cluster/test_catalogd_ha.py:540: in
test_warmed_up_metadata_after_failover
latest_catalogd = self._test_metadata_after_failover(unique_database, True)
tests/custom_cluster/test_catalogd_ha.py:584: in _test_metadata_after_failover
self.execute_query_expect_success(self.client, "describe %s.tbl" %
unique_database)
tests/common/impala_test_suite.py:1121: in wrapper
return function(*args, **kwargs)
tests/common/impala_test_suite.py:1131: in execute_query_expect_success
result = cls.__execute_query(impalad_client, query, query_options, user)
tests/common/impala_test_suite.py:1294: in __execute_query
return impalad_client.execute(query, user=user)
tests/common/impala_connection.py:687: in execute
cursor.execute(sql_stmt, configuration=self.__query_options)
infra/python/env-gcc10.4.0/lib/python2.7/site-packages/impala/hiveserver2.py:392:
in execute
configuration=configuration)
infra/python/env-gcc10.4.0/lib/python2.7/site-packages/impala/hiveserver2.py:443:
in execute_async
self._execute_async(op)
infra/python/env-gcc10.4.0/lib/python2.7/site-packages/impala/hiveserver2.py:462:
in _execute_async
operation_fn()
infra/python/env-gcc10.4.0/lib/python2.7/site-packages/impala/hiveserver2.py:440:
in op
run_async=True)
infra/python/env-gcc10.4.0/lib/python2.7/site-packages/impala/hiveserver2.py:1324:
in execute
return self._operation('ExecuteStatement', req, False)
infra/python/env-gcc10.4.0/lib/python2.7/site-packages/impala/hiveserver2.py:1244:
in _operation
resp = self._rpc(kind, request, safe_to_retry)
infra/python/env-gcc10.4.0/lib/python2.7/site-packages/impala/hiveserver2.py:1181:
in _rpc
err_if_rpc_not_ok(response)
infra/python/env-gcc10.4.0/lib/python2.7/site-packages/impala/hiveserver2.py:867:
in err_if_rpc_not_ok
raise HiveServer2Error(resp.status.errorMessage)
E HiveServer2Error: Query 6245b1a9b1f8117f:5788198500000000 failed:
E InternalException: Error requesting prioritized load: Couldn't open
transport for quanlong-Precision-3680:26000 (connect() failed: Connection
refused)
E Error making an RPC call to Catalog server.{code}
Coordinator logs:
{noformat}
I20250715 16:54:55.091974 2414650 FeSupport.java:340]
6245b1a9b1f8117f:5788198500000000] Requesting prioritized load of table(s):
test_warmed_up_metadata_after_failover_452d93b4.tbl
I20250715 16:54:55.096027 2414650 thrift-util.cc:238]
6245b1a9b1f8117f:5788198500000000] TSocket::open() connect() <Host:
quanlong-Precision-3680 Port: 26000>: Connection refused
I20250715 16:54:55.096114 2414650 thrift-util.cc:238]
6245b1a9b1f8117f:5788198500000000] TSocket::open() connect() <Host:
quanlong-Precision-3680 Port: 26000>: Connection refused
I20250715 16:54:55.096169 2414650 thrift-client.cc:82]
6245b1a9b1f8117f:5788198500000000] Couldn't open transport for
quanlong-Precision-3680:26000 (connect() failed: Connection refused)
I20250715 16:54:55.096179 2414650 thrift-client.cc:98]
6245b1a9b1f8117f:5788198500000000] Unable to connect to
quanlong-Precision-3680:26000
I20250715 16:54:55.607620 2414075 exec-env.cc:788] The address of Catalog
service is changed from quanlong-Precision-3680:26000 to
quanlong-Precision-3680:26001
...
I20250715 16:54:58.096710 2414650 thrift-util.cc:238]
6245b1a9b1f8117f:5788198500000000] TSocket::open() connect() <Host:
quanlong-Precision-3680 Port: 26000>: Connection refused
I20250715 16:54:58.096895 2414650 thrift-util.cc:238]
6245b1a9b1f8117f:5788198500000000] TSocket::open() connect() <Host:
quanlong-Precision-3680 Port: 26000>: Connection refused
I20250715 16:54:58.096943 2414650 thrift-client.cc:82]
6245b1a9b1f8117f:5788198500000000] Couldn't open transport for
quanlong-Precision-3680:26000 (connect() failed: Connection refused)
I20250715 16:54:58.096956 2414650 thrift-client.cc:98]
6245b1a9b1f8117f:5788198500000000] Unable to connect to
quanlong-Precision-3680:26000
...
I20250715 16:55:22.107597 2414650 jni-util.cc:321]
6245b1a9b1f8117f:5788198500000000] org.apache.impala.common.InternalException:
Error requesting prioritized load: Couldn't open transport for
quanlong-Precision-3680:26000 (connect() failed: Connection refused)
Error making an RPC call to Catalog server.
at
org.apache.impala.service.FeSupport.PrioritizeLoad(FeSupport.java:365)
at
org.apache.impala.catalog.ImpaladCatalog.prioritizeLoad(ImpaladCatalog.java:282)
at
org.apache.impala.analysis.StmtMetadataLoader.loadTables(StmtMetadataLoader.java:209)
at
org.apache.impala.analysis.StmtMetadataLoader.loadTables(StmtMetadataLoader.java:145)
at
org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2904)
at
org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2501)
at
org.apache.impala.service.Frontend.getTExecRequestWithFallback(Frontend.java:2370)
at
org.apache.impala.service.Frontend.createExecRequest(Frontend.java:2059)
at
org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:176){noformat}
It got the new address at 16:54:55.607620. But still retrying the RPC using the
old address.
> Coordinator should retry requests on the new active catalogd after HA failover
> ------------------------------------------------------------------------------
>
> Key: IMPALA-14228
> URL: https://issues.apache.org/jira/browse/IMPALA-14228
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Quanlong Huang
> Assignee: Quanlong Huang
> Priority: Major
>
> During catalogd HA failover, the active catalogd address is changed. RPCs
> that coordinator sent to the previous active catalogd will fail (e.g. when it
> crashes), which causing query failures. They should be retried on the current
> active catalogd.
> However, though coordinator will retry the RPC, they keep using the previous
> active catalogd address. E.g. in PrioritizeLoad RPCs:
> {code:cpp}
> Status CatalogOpExecutor::PrioritizeLoad(const TPrioritizeLoadRequest& req,
> TPrioritizeLoadResponse* result) {
> int attempt = 0; // Used for debug action only.
> CatalogServiceConnection::RpcStatus rpc_status =
> CatalogServiceConnection::DoRpcWithRetry(
> env_->catalogd_lightweight_req_client_cache(),
> *ExecEnv::GetInstance()->GetCatalogdAddress().get(), // a fixed
> address is used during retry
> {code}
> Due to this, test_warmed_up_metadata_after_failover is still flaky after
> fixing IMPALA-14227.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]