[Impala-ASF-CR] IMPALA-13829: Postpone catalog deleteLog GC for waitForHmsEvent requests

Quanlong Huang (Code Review) Thu, 24 Apr 2025 23:58:42 -0700

Hello Riza Suminto, Zoltan Borok-Nagy, Michael Smith, Impala Public Jenkins,


I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/22816

to look at the new patch set (#3).

Change subject: IMPALA-13829: Postpone catalog deleteLog GC for waitForHmsEvent 
requests
......................................................................

IMPALA-13829: Postpone catalog deleteLog GC for waitForHmsEvent requests

When a db/table is removed in the catalog cache, catalogd assigns it a
new catalog version and put it into the deleteLog. This is used for the
catalog update thread to collect deletion updates. Once the catalog
update thread collects a range of updates, it triggers GC in the
deleteLog to clear items older than the last sent catalog version. The
deletions will be broadcasted by statestore to all the coordinators
eventually.

However, waitForHmsEvent requests is also a consumer of the deleteLog
and could be impacted by these GCs. waitForHmsEvent is a catalogd RPC
used by coordinators when a query wants to wait until the related
metadata is in synced with HMS. The response of waitForHmsEvent returns
the latest metadata including the deletions on related dbs/tables.
If the related deletions in deleteLog is GCed just before the
waitForHmsEvent request collects the results, they will be missing in
the response. Coordinator might keep using stale metadata of
non-existing dbs/tables.

This is a quick fix for the issue by postponing deleteLog GC in a
configurable number of topic updates, similar to what we have done on
the TopicUpdateLog. A thorough fix might need to carefully choose the
version to GC or let impalad waits for the deletions from statestore to
arrive.

A new flag, catalog_delete_log_gc_frequency, is added for this. The
deleteLog GC happens in every N+1 (N=catalog_delete_log_gc_frequency)
topic updates. The GC only clears items removed before the last N-th
topic updates. The default is 1000 so a deletion can survive for around
2000 rounds of topic updates (at least 4000s). It should be safe enough,
i.e. the GCed deletions must have arrived in the impalad side, otherwise
that's an abnormal impalad and already has other more severe issues,
e.g. lots of stale tables due to metadata out of sync with catalogd.

Also removed some unused imports.

Tests:
 - Added e2e test with a debug action to reproduce the issue. Ran the
   test 100 times. Without the fix, it consistently fails when runs for
   2-3 times.

Change-Id: I2441440bca2b928205dd514047ba742a5e8bf05e
---
M be/src/catalog/catalog-server.cc
M be/src/util/backend-gflag-util.cc
M common/thrift/BackendGflags.thrift
M common/thrift/CatalogService.thrift
M fe/src/main/java/org/apache/impala/catalog/CatalogDeltaLog.java
M fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java
M fe/src/main/java/org/apache/impala/service/FeSupport.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniCatalog.java
M fe/src/main/java/org/apache/impala/util/DebugUtils.java
M tests/metadata/test_event_processing.py
11 files changed, 126 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/16/22816/3
--
To view, visit http://gerrit.cloudera.org:8080/22816
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2441440bca2b928205dd514047ba742a5e8bf05e
Gerrit-Change-Number: 22816
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Michael Smith <michael.sm...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>

[Impala-ASF-CR] IMPALA-13829: Postpone catalog deleteLog GC for waitForHmsEvent requests

Reply via email to