[jira] [Commented] (KUDU-3649) Last processed Hive Metastore notification event ID is not loaded correctly

Alexey Serbin (Jira) Mon, 10 Mar 2025 14:23:30 -0700


    [ 
https://issues.apache.org/jira/browse/KUDU-3649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17934027#comment-17934027
 ]


Alexey Serbin commented on KUDU-3649:
-------------------------------------

Thank you for taking a look at this.

{quote}
While launching kudu-master with Hive Metastore integration enabled, I don't 
see the following log:
{quote}

You wouldn't see it in the log before a particular instance of system catalog 
(i.e. master) is elected a leader for the system catalog tablet Raft group.  In 
fact, in case of multi-master deployment you could never see it at all in the 
logs of a particular kudu-master process if it's never elected as a Raft leader 
for the system catalog tablet.

{quote}
This means there is a race between initializing hms_catalog_ here and using it 
here.
{quote}

Regardless of whether there is a race or not, when enabling HMS integration for 
a Kudu master where HMS integration was disabled when Kudu master ran the very 
first time, it will use -1 as the last known HMS notification event ID.  That's 
because originally there isn't anything stored in the system catalog for the 
previously seen event ID from HMS.  You could see that [in the 
code|https://github.com/apache/kudu/blob/7ce47309845ededefd839ffd665d274cf34a860c/src/kudu/master/sys_catalog.cc#L827-L843],
 where the 'processor' is fed with an empty input while running 
{{ProcessRows<SysNotificationLogEventIdPB, HMS_NOTIFICATION_LOG>(processor)}}:
{noformat}
Status SysCatalogTable::GetLatestNotificationLogEventId(int64_t* event_id) {
  TRACE_EVENT0("master", "SysCatalogTable::GetLatestNotificationLogEventId");

  *event_id = -1;
  auto processor = [&](const string& entry_id, const 
SysNotificationLogEventIdPB& entry_data) {
    if (entry_id != kLatestNotificationLogEntryIdRowId) {
      // This is not the row we're looking for.
      return Status::OK();
    }
    DCHECK(entry_data.has_latest_notification_log_event_id());
    DCHECK(entry_data.latest_notification_log_event_id() >= 0);
    *event_id = entry_data.latest_notification_log_event_id();
    return Status::OK();
  };

  return ProcessRows<SysNotificationLogEventIdPB, 
HMS_NOTIFICATION_LOG>(processor);
}
{noformat}

{quote}
The hms-notification-log-listener thread in Kudu-master will start fetching all 
HMS notification events due to this. 
{quote}

Yes, and that's what it should be doing in such a situation, as mentioned 
above. Since there was not any prior interaction with HMS, it uses '-1' and the 
prior ID to start fetching events from the HMS, effectively fetching the whole 
history.  IIUC, that's by design.  If that's not OK, then what's an alternative?

BTW, the patch posted at 
[https://gerrit.cloudera.org/c/22602/|https://gerrit.cloudera.org/c/22602/] has 
nothing to change that in any way as of PS1, so it wouldn't help with HMS OOM 
issue mentioned.

{quote}
In my local env, there are lots of HMS notification events that have a large 
message body causing HMS OOM to serve these requests
{quote}

I'd think of filing a JIRA with the HMS/Hive project to make it more robust 
then.  Hitting OOM condition when trying to send out information on just 100 
events per batch is worrisome, indeed, even if the whole history of events is 
being fetched eventually.

I'm going to take a closer look to check if there is indeed a race there.  
Meanwhile, if you suspect the HMS OOM issue might be worked around by fetching 
HMS events in smaller batches, consider tweaking the 
{{\-\-hive_metastore_notification_log_batch_size}} [flag for Kudu 
master|https://kudu.apache.org/docs/configuration_reference.html#kudu-master_hive_metastore_notification_log_batch_size],
 setting it to a smaller value.

> Last processed Hive Metastore notification event ID is not loaded correctly
> ---------------------------------------------------------------------------
>
>                 Key: KUDU-3649
>                 URL: https://issues.apache.org/jira/browse/KUDU-3649
>             Project: Kudu
>          Issue Type: Bug
>          Components: server
>            Reporter: Quanlong Huang
>            Priority: Critical
>         Attachments: kudu-debug.patch, 
> kudu-master.quanlong-OptiPlex-BJ.quanlong.log.INFO.20250309-094427.22679
>
>
> While launching kudu-master with Hive Metastore integration enabled, I don't 
> see the following log:
> {code:cpp}
>     if (hms_catalog_) {
>       static const char* const kNotificationLogEventIdDescription =
>           "Loading latest processed Hive Metastore notification log event ID";
>       LOG(INFO) << kNotificationLogEventIdDescription << "...";{code}
> https://github.com/apache/kudu/blob/e742f86f6d8e687dd02d9891f33e068477163016/src/kudu/master/catalog_manager.cc#L1446-L1447
> The kudu version used in Impala's test env is commit e742f86f6. I tried 
> patching kudu to add more debug logs by  [^kudu-debug.patch]. Then I see the 
> following logs:
> {noformat}
> I20250309 09:44:28.930799 22774 catalog_manager.cc:1434] Initializing 
> in-progress tserver states...
> I20250309 09:44:28.930833 22774 catalog_manager.cc:1455] hms_catalog_ is 
> nullptr
> I20250309 09:44:28.930859 22765 hms_catalog.cc:109] Initializing HmsCatalog
> I20250309 09:44:28.931007 22790 hms_notification_log_listener.cc:222] 
> durable_event_id = -1, batch_size = 100 
> I20250309 09:44:28.936439 22788 hms_client.cc:369] Fetching 100 HMS events 
> from id -1{noformat}
> This means there is a race between initializing hms_catalog_ 
> [here|https://github.com/apache/kudu/blob/e742f86f6d8e687dd02d9891f33e068477163016/src/kudu/master/catalog_manager.cc#L1043]
>  and using it 
> [here|https://github.com/apache/kudu/blob/e742f86f6d8e687dd02d9891f33e068477163016/src/kudu/master/catalog_manager.cc#L1444].
> hms_notification_log_event_id_ is not loaded correctly and keeps using its 
> original value -1. The hms-notification-log-listener thread in Kudu-master 
> will start fetching all HMS notification events due to this. In my local env, 
> there are lots of HMS notification events that have a large message body 
> causing HMS OOM to serve these requests. So 
> HmsNotificationLogListenerTask::Poll() never succeeds and keep polling events 
> from id -1. Attached the master logs: 
> [^kudu-master.quanlong-OptiPlex-BJ.quanlong.log.INFO.20250309-094427.22679]
> Due to this, creating managed tables in Kudu will failed with the following 
> error. IMPALA-13846 is an example.
> {code}
> failed to wait for Hive Metastore notification log listener to catch up: 
> failed to retrieve notification log events: failed to get Hive Metastore next 
> notification: No more data to read.{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KUDU-3649) Last processed Hive Metastore notification event ID is not loaded correctly

Reply via email to