[jira] [Commented] (IMPALA-14228) Coordinator should retry requests on the new active catalogd after HA failover

Quanlong Huang (Jira) Tue, 15 Jul 2025 02:43:08 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-14228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18005548#comment-18005548
 ]


Quanlong Huang commented on IMPALA-14228:
-----------------------------------------

The test failure is
{code:python}
tests/custom_cluster/test_catalogd_ha.py:540: in 
test_warmed_up_metadata_after_failover
    latest_catalogd = self._test_metadata_after_failover(unique_database, True)
tests/custom_cluster/test_catalogd_ha.py:584: in _test_metadata_after_failover
    self.execute_query_expect_success(self.client, "describe %s.tbl" % 
unique_database)
tests/common/impala_test_suite.py:1121: in wrapper
    return function(*args, **kwargs)
tests/common/impala_test_suite.py:1131: in execute_query_expect_success
    result = cls.__execute_query(impalad_client, query, query_options, user)
tests/common/impala_test_suite.py:1294: in __execute_query
    return impalad_client.execute(query, user=user)
tests/common/impala_connection.py:687: in execute
    cursor.execute(sql_stmt, configuration=self.__query_options)
infra/python/env-gcc10.4.0/lib/python2.7/site-packages/impala/hiveserver2.py:392:
 in execute
    configuration=configuration)
infra/python/env-gcc10.4.0/lib/python2.7/site-packages/impala/hiveserver2.py:443:
 in execute_async
    self._execute_async(op)
infra/python/env-gcc10.4.0/lib/python2.7/site-packages/impala/hiveserver2.py:462:
 in _execute_async
    operation_fn()
infra/python/env-gcc10.4.0/lib/python2.7/site-packages/impala/hiveserver2.py:440:
 in op
    run_async=True)
infra/python/env-gcc10.4.0/lib/python2.7/site-packages/impala/hiveserver2.py:1324:
 in execute
    return self._operation('ExecuteStatement', req, False)
infra/python/env-gcc10.4.0/lib/python2.7/site-packages/impala/hiveserver2.py:1244:
 in _operation
    resp = self._rpc(kind, request, safe_to_retry)
infra/python/env-gcc10.4.0/lib/python2.7/site-packages/impala/hiveserver2.py:1181:
 in _rpc
    err_if_rpc_not_ok(response)
infra/python/env-gcc10.4.0/lib/python2.7/site-packages/impala/hiveserver2.py:867:
 in err_if_rpc_not_ok
    raise HiveServer2Error(resp.status.errorMessage)
E   HiveServer2Error: Query 6245b1a9b1f8117f:5788198500000000 failed:
E   InternalException: Error requesting prioritized load: Couldn't open 
transport for quanlong-Precision-3680:26000 (connect() failed: Connection 
refused)
E   Error making an RPC call to Catalog server.{code}
Coordinator logs:
{noformat}
I20250715 16:54:55.091974 2414650 FeSupport.java:340] 
6245b1a9b1f8117f:5788198500000000] Requesting prioritized load of table(s): 
test_warmed_up_metadata_after_failover_452d93b4.tbl
I20250715 16:54:55.096027 2414650 thrift-util.cc:238] 
6245b1a9b1f8117f:5788198500000000] TSocket::open() connect() <Host: 
quanlong-Precision-3680 Port: 26000>: Connection refused
I20250715 16:54:55.096114 2414650 thrift-util.cc:238] 
6245b1a9b1f8117f:5788198500000000] TSocket::open() connect() <Host: 
quanlong-Precision-3680 Port: 26000>: Connection refused
I20250715 16:54:55.096169 2414650 thrift-client.cc:82] 
6245b1a9b1f8117f:5788198500000000] Couldn't open transport for 
quanlong-Precision-3680:26000 (connect() failed: Connection refused)
I20250715 16:54:55.096179 2414650 thrift-client.cc:98] 
6245b1a9b1f8117f:5788198500000000] Unable to connect to 
quanlong-Precision-3680:26000
I20250715 16:54:55.607620 2414075 exec-env.cc:788] The address of Catalog 
service is changed from quanlong-Precision-3680:26000 to 
quanlong-Precision-3680:26001
...
I20250715 16:54:58.096710 2414650 thrift-util.cc:238] 
6245b1a9b1f8117f:5788198500000000] TSocket::open() connect() <Host: 
quanlong-Precision-3680 Port: 26000>: Connection refused
I20250715 16:54:58.096895 2414650 thrift-util.cc:238] 
6245b1a9b1f8117f:5788198500000000] TSocket::open() connect() <Host: 
quanlong-Precision-3680 Port: 26000>: Connection refused
I20250715 16:54:58.096943 2414650 thrift-client.cc:82] 
6245b1a9b1f8117f:5788198500000000] Couldn't open transport for 
quanlong-Precision-3680:26000 (connect() failed: Connection refused)
I20250715 16:54:58.096956 2414650 thrift-client.cc:98] 
6245b1a9b1f8117f:5788198500000000] Unable to connect to 
quanlong-Precision-3680:26000
...
I20250715 16:55:22.107597 2414650 jni-util.cc:321] 
6245b1a9b1f8117f:5788198500000000] org.apache.impala.common.InternalException: 
Error requesting prioritized load: Couldn't open transport for 
quanlong-Precision-3680:26000 (connect() failed: Connection refused)
Error making an RPC call to Catalog server.
        at 
org.apache.impala.service.FeSupport.PrioritizeLoad(FeSupport.java:365)
        at 
org.apache.impala.catalog.ImpaladCatalog.prioritizeLoad(ImpaladCatalog.java:282)
        at 
org.apache.impala.analysis.StmtMetadataLoader.loadTables(StmtMetadataLoader.java:209)
        at 
org.apache.impala.analysis.StmtMetadataLoader.loadTables(StmtMetadataLoader.java:145)
        at 
org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:2904)
        at 
org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:2501)
        at 
org.apache.impala.service.Frontend.getTExecRequestWithFallback(Frontend.java:2370)
        at 
org.apache.impala.service.Frontend.createExecRequest(Frontend.java:2059)
        at 
org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:176){noformat}
It got the new address at 16:54:55.607620. But still retrying the RPC using the 
old address.

> Coordinator should retry requests on the new active catalogd after HA failover
> ------------------------------------------------------------------------------
>
>                 Key: IMPALA-14228
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14228
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Major
>
> During catalogd HA failover, the active catalogd address is changed. RPCs 
> that coordinator sent to the previous active catalogd will fail (e.g. when it 
> crashes), which causing query failures. They should be retried on the current 
> active catalogd.
> However, though coordinator will retry the RPC, they keep using the previous 
> active catalogd address. E.g. in PrioritizeLoad RPCs:
> {code:cpp}
> Status CatalogOpExecutor::PrioritizeLoad(const TPrioritizeLoadRequest& req,
>     TPrioritizeLoadResponse* result) {
>   int attempt = 0; // Used for debug action only.
>   CatalogServiceConnection::RpcStatus rpc_status =
>       CatalogServiceConnection::DoRpcWithRetry(
>           env_->catalogd_lightweight_req_client_cache(),
>           *ExecEnv::GetInstance()->GetCatalogdAddress().get(), // a fixed 
> address is used during retry
> {code}
> Due to this, test_warmed_up_metadata_after_failover is still flaky after 
> fixing IMPALA-14227.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-14228) Coordinator should retry requests on the new active catalogd after HA failover

Reply via email to