Hello Riza Suminto, Zoltan Borok-Nagy, Michael Smith, Impala Public Jenkins,
I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/22816 to look at the new patch set (#3). Change subject: IMPALA-13829: Postpone catalog deleteLog GC for waitForHmsEvent requests ...................................................................... IMPALA-13829: Postpone catalog deleteLog GC for waitForHmsEvent requests When a db/table is removed in the catalog cache, catalogd assigns it a new catalog version and put it into the deleteLog. This is used for the catalog update thread to collect deletion updates. Once the catalog update thread collects a range of updates, it triggers GC in the deleteLog to clear items older than the last sent catalog version. The deletions will be broadcasted by statestore to all the coordinators eventually. However, waitForHmsEvent requests is also a consumer of the deleteLog and could be impacted by these GCs. waitForHmsEvent is a catalogd RPC used by coordinators when a query wants to wait until the related metadata is in synced with HMS. The response of waitForHmsEvent returns the latest metadata including the deletions on related dbs/tables. If the related deletions in deleteLog is GCed just before the waitForHmsEvent request collects the results, they will be missing in the response. Coordinator might keep using stale metadata of non-existing dbs/tables. This is a quick fix for the issue by postponing deleteLog GC in a configurable number of topic updates, similar to what we have done on the TopicUpdateLog. A thorough fix might need to carefully choose the version to GC or let impalad waits for the deletions from statestore to arrive. A new flag, catalog_delete_log_gc_frequency, is added for this. The deleteLog GC happens in every N+1 (N=catalog_delete_log_gc_frequency) topic updates. The GC only clears items removed before the last N-th topic updates. The default is 1000 so a deletion can survive for around 2000 rounds of topic updates (at least 4000s). It should be safe enough, i.e. the GCed deletions must have arrived in the impalad side, otherwise that's an abnormal impalad and already has other more severe issues, e.g. lots of stale tables due to metadata out of sync with catalogd. Also removed some unused imports. Tests: - Added e2e test with a debug action to reproduce the issue. Ran the test 100 times. Without the fix, it consistently fails when runs for 2-3 times. Change-Id: I2441440bca2b928205dd514047ba742a5e8bf05e --- M be/src/catalog/catalog-server.cc M be/src/util/backend-gflag-util.cc M common/thrift/BackendGflags.thrift M common/thrift/CatalogService.thrift M fe/src/main/java/org/apache/impala/catalog/CatalogDeltaLog.java M fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java M fe/src/main/java/org/apache/impala/service/FeSupport.java M fe/src/main/java/org/apache/impala/service/Frontend.java M fe/src/main/java/org/apache/impala/service/JniCatalog.java M fe/src/main/java/org/apache/impala/util/DebugUtils.java M tests/metadata/test_event_processing.py 11 files changed, 126 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/16/22816/3 -- To view, visit http://gerrit.cloudera.org:8080/22816 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2441440bca2b928205dd514047ba742a5e8bf05e Gerrit-Change-Number: 22816 Gerrit-PatchSet: 3 Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Michael Smith <michael.sm...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>