[jira] [Commented] (IMPALA-13850) Catalogd should not start metadata operation until initialization is done if HA is enabled

ASF subversion and git services (Jira) Wed, 16 Apr 2025 20:32:57 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-13850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17945216#comment-17945216
 ]


ASF subversion and git services commented on IMPALA-13850:
----------------------------------------------------------

Commit 55feffb41b7f1d126efac76bfb269179a89f5f64 in impala's branch 
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=55feffb41 ]

IMPALA-13850 (part 1): Wait until CatalogD active before resetting

In HA mode, CatalogD initialization can fail to complete within
reasonable time. Log messages showed that CatalogD is blocked trying to
acquire "CatalogServer.catalog_lock_" when calling
CatalogServer::UpdateActiveCatalogd() during statestore subscriber
registration. catalog_lock_ was held by GatherCatalogUpdatesThread which
is calling GetCatalogDelta(), which waits for the java lock versionLock_
which is held by the thread doing CatalogServiceCatalog.reset().

This patch remove catalog reset in JniCatalog constructor. In turn,
catalogd-server.cc is now responsible to trigger the metadata
reset (Invaidate Metadata) only if:

1. It is the active CatalogD, and
2. Gathering thread has collect the first topic update or CatalogD is
   set with catalog_topic_mode other than "minimal".

The later prerequisite is to ensure that all coordinators are not
blocked waiting for full topic update in on-demand metadata mode. This
is all managed by a new thread method TriggerResetMetadata that monitor
and trigger the initial reset metadata.

Note that this is a behavior change in on-demand catalog
mode (catalog_topic_mode=minimal). Previously, on-demand catalog mode
will send full database list in its first catalog topic update. This
behavior change is OK since coordinator can request metadata on-demand.

After this patch, catalog-server.active-status and /healthz page can
turn into true and OK respectively even if the very first metadata reset
is still ongoing. Observer that cares about having fully populated
metadata should check other metrics such as catalog.num-db,
catalog.num-tables, or /catalog page content.

Updated start-impala-cluster.py readiness check to wait for at least 1
table to be seen by coordinators, except during create-load-data.sh
execution (there is no table yet) and when use_local_catalog=true (local
catalog cache does not start with any table). Modified startup flag
checking from reading the actual command line args to reading the
'/varz?json' page of the daemon. Cleanup impala_service.py to fix some
flake8 issues.

Slightly update TestLocalCatalogCompactUpdates::test_restart_catalogd so
that unique_database cleanup is successful.

Testing:
- Refactor test_catalogd_ha.py to reduce repeated code, use
  unique_database fixture, and additionally validate /healthz page of
  both active and standby catalogd. Changed it to test using hs2
  protocol by default.
- Run and pass test_catalogd_ha.py and test_concurrent_ddls.py.
- Pass core tests.

Change-Id: I58cc66dcccedb306ff11893f2916ee5ee6a3efc1
Reviewed-on: http://gerrit.cloudera.org:8080/22634
Reviewed-by: Riza Suminto <[email protected]>
Tested-by: Riza Suminto <[email protected]>


> Catalogd should not start metadata operation until initialization is done if 
> HA is enabled
> ------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-13850
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13850
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>            Reporter: Wenzhe Zhou
>            Assignee: Riza Suminto
>            Priority: Critical
>
> In a case reported by user, the catalogd initialization failed to complete. 
> Log messages showed that catalog HA was enabled. catalogd was blocked when 
> trying to acquire "CatalogServer.catalog_lock_" when calling 
> CatalogServer::UpdateActiveCatalogd() during statestore subscriber 
> registration.
> Log message showed that there was IM command issued before catalogd tried to 
> register to statestore.
> {code:java}
> I0310 12:21:34.093617     1 CatalogServiceCatalog.java:2188] Invalidated all 
> metadata.
> I0310 12:21:34.094341     1 thrift-server.cc:419] ThriftServer 
> 'StatestoreSubscriber' started on port: 23020
> I0310 12:21:34.094341  1816 TAcceptQueueServer.cpp:329] 
> connection_setup_thread_pool_size is set to 2
> I0310 12:21:34.094586     1 thrift-util.cc:198] TSocket::open() error on 
> socket (after THRIFT_POLL) <Host: localhost Port: 23020>: Connection refused
> I0310 12:21:34.094790     1 statestore-subscriber.cc:745] Starting statestore 
> subscriber
> {code}
> We should not allow any metadata operation until initialization is done. When 
> HA is enabled, catalog-server should not hold "CatalogServer.catalog_lock_" 
> for long time before active catalogd is assigned.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-13850) Catalogd should not start metadata operation until initialization is done if HA is enabled

Reply via email to