[
https://issues.apache.org/jira/browse/IMPALA-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Quanlong Huang resolved IMPALA-14283.
-------------------------------------
Fix Version/s: Impala 5.0.0
Resolution: Fixed
> Coordinator could use stale metadata in catalogd warm failover
> --------------------------------------------------------------
>
> Key: IMPALA-14283
> URL: https://issues.apache.org/jira/browse/IMPALA-14283
> Project: IMPALA
> Issue Type: Bug
> Components: Catalog
> Reporter: Quanlong Huang
> Assignee: Quanlong Huang
> Priority: Critical
> Fix For: Impala 5.0.0
>
> Attachments: catalogd.INFO, catalogd_node1.INFO, impalad.INFO,
> show_parts_profile.txt, statestored.INFO
>
>
> There is a corner case in local catalog mode that when catalogd warm failover
> finishes, coordinator fetches the new table object from the new active
> catalogd and it might coincidently "activate" the stale metadata in the
> cache. This happens when matching all the following conditions:
> * The new table object has a catalog version that has been used in the
> previous active catalogd and it was assigned to the same table name.
> * The coordinator has loaded table metadata associated to that version from
> the previous active catalogd, e.g. partition list, partitions or column stats
> are loaded and still in the cache (not aged out or evicted yet).
> * The initial catalog update of the new active catalogd hasn't arrived yet.
> There is some delay when statestore broadcast the updates. If the statestore
> update arrives before the query planning, we are safe since processing the
> update will clear the cache.
> This bug is found by *TestCatalogdHA.test_warmed_up_metadata_failover_catchup*
> {code:python}
> tests/custom_cluster/test_catalogd_ha.py:640: in
> test_warmed_up_metadata_failover_catchup
>
>
> db, self._add_new_partition, self._verify_new_partition)
>
>
> tests/custom_cluster/test_catalogd_ha.py:762: in
> _test_metadata_after_failover
>
>
> assert str(active_catalogd.service.service_port) in str(e)
>
>
> E assert '26001' in "assert 'year=2025/month=1' in
> 'Total\\t\\t-1\\t0\\t0B\\t0B\\t\\t\\t\\t\\t'"
>
>
> E + where '26001' = str(26001)
>
>
> E + where 26001 = <tests.common.impala_service.CatalogdService object
> at 0x7f5d8c24fd10>.service_port
>
> E + where <tests.common.impala_service.CatalogdService object at
> 0x7f5d8c24fd10> = <CatalogdProcess PID: None
> (/root/Impala-upstream/be/build/latest/service/cata...te_store_subscriber_port=23021
> -webserver_
> port=25021 -enable_catalogd_ha=true)>.service
>
>
> E + and "assert 'year=2025/month=1' in
> 'Total\\t\\t-1\\t0\\t0B\\t0B\\t\\t\\t\\t\\t'" = str(AssertionError(u"assert
> 'year=2025/month=1' in 'Total\\t\\t-1\\t0\\t0B\\t0B\\t\\t\\t\\t\\t'",)){code}
> The test runs an AddPartition statement and kills the active catalogd, then
> runs ShowPartitions in the new active catalogd. It fails since coordinator
> returns an empty partition list. Test logs:
> {noformat}
> -- 2025-07-31 17:10:00,755 INFO MainThread: hs2: executing against Impala
> at quanlong-ha-test.vpc.cloudera.com:21050. session:
> ea4e2fca5d5b35f3:c51bd83bdb2417a9 main_cursor: True user: None
>
>
>
>
> alter table warmup_test_db.tbl add partition(year=2025, month=1);
>
>
>
>
>
> -- 2025-07-31 17:10:01,119 INFO MainThread:
> b84a8a27014a3340:af77289200000000: query started
>
>
> -- 2025-07-31 17:10:01,120 INFO MainThread:
> b84a8a27014a3340:af77289200000000: getting log for operation
>
>
> -- 2025-07-31 17:10:01,120 INFO MainThread:
> b84a8a27014a3340:af77289200000000: getting runtime profile operation
>
>
> -- 2025-07-31 17:10:01,120 INFO MainThread:
> b84a8a27014a3340:af77289200000000: closing query for operation
>
>
> -- 2025-07-31 17:10:01,147 INFO MainThread: Found PID 102378 for
> /root/Impala-upstream/be/build/latest/service/catalogd -logbufsecs=5 -v=1
> -max_log_files=10 -log_rotation_match_pid=true -log_filename=catalogd_n
> ode1 -log_dir=/tmp/ -kudu_master_hosts localhost --catalog_topic_mode=minimal
> --catalogd_ha_reset_metadata_on_failover=false
> --debug_actions=catalogd_event_processing_delay:SLEEP@1000
> --enable_reload_events=true --
> warmup_tables_config_file=file:///root/Impala-upstream/testdata/data/warmup_test_config.txt
> -catalog_service_port=26001 -state_store_subscriber_port=23021
> -webserver_port=25021 -enable_catalogd_ha=true
> -- 2025-07-31 17:10:01,172 INFO MainThread: Killing <CatalogdProcess PID:
> 102378 (/root/Impala-upstream/be/build/latest/service/catalogd -logbufsecs=5
> -v=1 -max_log_files=10 -log_rotation_match_pid=true -log_fi
> lename=catalogd_node1 -log_dir=/tmp/ -kudu_master_hosts localhost
> --catalog_topic_mode=minimal --catalogd_ha_reset_metadata_on_failover=false
> --debug_actions=catalogd_event_processing_delay:SLEEP@1000 --enable_relo
> ad_events=true
> --warmup_tables_config_file=file:///root/Impala-upstream/testdata/data/warmup_test_config.txt
> -catalog_service_port=26001 -state_store_subscriber_port=23021
> -webserver_port=25021 -enable_catalogd_ha=
> true)> with signal 9
>
>
> -- 2025-07-31 17:10:01,177 INFO MainThread: Getting metric:
> catalog-server.active-status from quanlong-ha-test.vpc.cloudera.com:25020
>
> -- 2025-07-31 17:10:01,180 INFO MainThread: Waiting for metric value
> 'catalog-server.active-status'=True. Current value: False. total_wait: 0s
> -- 2025-07-31 17:10:01,180 INFO MainThread: Sleeping 1s before next
> retry.
>
> -- 2025-07-31 17:10:02,181 INFO MainThread: Getting metric:
> catalog-server.active-status from quanlong-ha-test.vpc.cloudera.com:25020
>
> -- 2025-07-31 17:10:02,184 INFO MainThread: Metric
> 'catalog-server.active-status' has reached desired value: True. total_wait:
> 1.00396800041s
> -- 2025-07-31 17:10:02,188 INFO MainThread: hs2: executing against Impala
> at quanlong-ha-test.vpc.cloudera.com:21050. session:
> ea4e2fca5d5b35f3:c51bd83bdb2417a9 main_cursor: True user: None
>
>
>
> show partitions warmup_test_db.tbl;
>
>
>
> -- 2025-07-31 17:10:03,737 INFO MainThread:
> 664e4d97f4712dfb:e4d0877500000000: query started
> -- 2025-07-31 17:10:03,738 INFO MainThread:
> 664e4d97f4712dfb:e4d0877500000000: getting log for operation
> -- 2025-07-31 17:10:03,738 INFO MainThread:
> 664e4d97f4712dfb:e4d0877500000000: getting runtime profile operation
> -- 2025-07-31 17:10:03,738 INFO MainThread:
> 664e4d97f4712dfb:e4d0877500000000: closing query for operation
> -- 2025-07-31 17:10:03,739 INFO MainThread: partition result:
> ['Total\t\t-1\t0\t0B\t0B\t\t\t\t\t']{noformat}
> Profile of ShowPartitions indicates some stale metadata are used:
> {noformat}
> Original Table Versions: warmup_test_db.tbl, 314, 1754007002807, Thu
> Jul 31 17:10:02 PDT 2025
> - CatalogFetch.ColumnStats.Hits: 13 <--- This shouldn't hit
> - CatalogFetch.ColumnStats.Misses: 0
> - CatalogFetch.ColumnStats.Requests: 13
> - CatalogFetch.ColumnStats.Time: 0
> - CatalogFetch.DatabaseList.Hits: 1
> - CatalogFetch.DatabaseList.Misses: 0
> - CatalogFetch.DatabaseList.Requests: 1
> - CatalogFetch.DatabaseList.Time: 0
> - CatalogFetch.PartitionLists.Hits: 1 <--- This shouldn't hit
> - CatalogFetch.PartitionLists.Misses: 0
> - CatalogFetch.PartitionLists.Requests: 1
> - CatalogFetch.PartitionLists.Time: 0
> - CatalogFetch.RPCs.Bytes: 1.78 KB (1818)
> - CatalogFetch.RPCs.Requests: 2
> - CatalogFetch.RPCs.Time: 1s531ms
> - CatalogFetch.StorageLoad.Time: 22ms
> - CatalogFetch.TableList.Hits: 1
> - CatalogFetch.TableList.Misses: 1
> - CatalogFetch.TableList.Requests: 2
> - CatalogFetch.TableList.Time: 1s516ms
> - CatalogFetch.Tables.Hits: 0
> - CatalogFetch.Tables.Misses: 1
> - CatalogFetch.Tables.Requests: 1
> - CatalogFetch.Tables.Time: 15ms
> - ExecutorGroupsConsidered: 1 (1){noformat}
> In logs of the new active catalogd (catalogd.INFO), the catalog version of
> the table is 314:
> {noformat}
> I20250731 17:10:02.760269 103771 CatalogOpExecutor.java:4872] Found 1/1
> partitions to add in table warmup_test_db.tbl from event 888
> I20250731 17:10:02.807034 103771 HdfsTable.java:781] Loaded file and block
> metadata for warmup_test_db.tbl partitions: year=2025/month=1. Time taken:
> 27.856ms
> I20250731 17:10:02.807621 103771 MetastoreEvents.java:860] EventId: 888
> EventType: ADD_PARTITION Target: warmup_test_db.tbl. Successfully added 1
> partitions to table warmup_test_db.tbl
> I20250731 17:10:02.807871 103771 MetastoreEventsProcessor.java:1578] Time
> elapsed in processing event batch: 1s156ms
> I20250731 17:10:03.673661 103802 JniUtil.java:167] getEventProcessorSummary
> request: Getting event processor summary
> I20250731 17:10:03.674260 103802 JniUtil.java:178] Finished
> getEventProcessorSummary request: Getting event processor summary. Time
> spent: 1ms
> I20250731 17:10:03.674295 103802 catalog-server.cc:978] EventProcessor is
> synced up with HMS event id during failover: 888
> I20250731 17:10:03.674309 103802 catalog-server.cc:887] This catalogd
> instance is changed to active status
> I20250731 17:10:03.677537 103773 JniUtil.java:167] getCatalogDelta request:
> Getting catalog delta from version 0
> I20250731 17:10:03.687700 103773 catalog-server.cc:1615] Collected update:
> 2:DATABASE:default, version=2, original size=48, compressed size=49
> ...
> I20250731 17:10:03.720110 103773 CatalogServiceCatalog.java:1965] Skipped 0
> partitions of table warmup_test_db.tbl in the incremental update
> I20250731 17:10:03.720285 103773 CatalogServiceCatalog.java:1978] Collected 1
> partition update(s): 2:HDFS_PARTITION:warmup_test_db.tbl:year=2025/month=1,
> version=314, original size=100, compressed size=101
> I20250731 17:10:03.720324 103773 catalog-server.cc:1615] Collected update:
> 2:TABLE:warmup_test_db.tbl, version=314, original size=65, compressed
> size=66{noformat}
> In the previous catalogd logs (catalogd_node1.INFO), the table version before
> AddPartition is also 314!
> {noformat}
> I20250731 17:09:58.553951 103647 HdfsTable.java:1256] Loading metadata for
> table definition and all partition(s) of warmup_test_db.tbl (needed by
> coordinator)
> I20250731 17:09:58.600651 103647 HdfsTable.java:1919] Loaded 13 columns from
> HMS. Actual columns: 13
> I20250731 17:09:58.623989 103647 HdfsTable.java:2987] Load Valid Write Id
> List Done. Time taken: 24.253us
> I20250731 17:09:58.624092 103647 HdfsTable.java:1303] Fetching partition
> metadata from the Metastore: warmup_test_db.tbl
> I20250731 17:09:58.628580 103647 MetaStoreUtil.java:191] Fetching 0
> partitions for: warmup_test_db.tbl using partition batch size: 1000
> I20250731 17:09:58.628685 103647 HdfsTable.java:1310] Fetched partition
> metadata from the Metastore: warmup_test_db.tbl
> I20250731 17:09:58.649619 103647 HdfsTable.java:781] Loaded file and block
> metadata for warmup_test_db.tbl partitions: . Time taken: 3.319ms
> I20250731 17:09:58.651077 103647 Table.java:1106] last refreshed event id for
> table: warmup_test_db.tbl set to: -1
> I20250731 17:09:58.651185 103647 TableLoader.java:187] Loaded metadata for:
> warmup_test_db.tbl (129ms)
> I20250731 17:09:58.685761 102954 JniUtil.java:167] getCatalogDelta request:
> Getting catalog delta from version 313
> I20250731 17:09:58.693717 102954 CatalogServiceCatalog.java:1965] Skipped 0
> partitions of table warmup_test_db.tbl in the incremental update
> I20250731 17:09:58.693792 102954 catalog-server.cc:1615] Collected update:
> 2:TABLE:warmup_test_db.tbl, version=314, original size=65, compressed
> size=66{noformat}
> So coordinator has loaded the empty partition list of it in cache. It's
> associated with table name "warmup_test_db.tbl" and catalog version 314. Then
> coincidently reused after catalogd warm failover.
> We should add the catalog service id in the getPartialCatalogObject response
> and coordinator should clear the cache when the id changes, just like what we
> do in updateCatalogCache() ->
> [witnessCatalogServiceId()|https://github.com/apache/impala/blob/5414c3009890a736901b6ae3b56b9953b1992de7/fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java#L1614].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]