[ https://issues.apache.org/jira/browse/KUDU-3649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17934027#comment-17934027 ]
Alexey Serbin commented on KUDU-3649: ------------------------------------- Thank you for taking a look at this. {quote} While launching kudu-master with Hive Metastore integration enabled, I don't see the following log: {quote} You wouldn't see it in the log before a particular instance of system catalog (i.e. master) is elected a leader for the system catalog tablet Raft group. In fact, in case of multi-master deployment you could never see it at all in the logs of a particular kudu-master process if it's never elected as a Raft leader for the system catalog tablet. {quote} This means there is a race between initializing hms_catalog_ here and using it here. {quote} Regardless of whether there is a race or not, when enabling HMS integration for a Kudu master where HMS integration was disabled when Kudu master ran the very first time, it will use -1 as the last known HMS notification event ID. That's because originally there isn't anything stored in the system catalog for the previously seen event ID from HMS. You could see that [in the code|https://github.com/apache/kudu/blob/7ce47309845ededefd839ffd665d274cf34a860c/src/kudu/master/sys_catalog.cc#L827-L843], where the 'processor' is fed with an empty input while running {{ProcessRows<SysNotificationLogEventIdPB, HMS_NOTIFICATION_LOG>(processor)}}: {noformat} Status SysCatalogTable::GetLatestNotificationLogEventId(int64_t* event_id) { TRACE_EVENT0("master", "SysCatalogTable::GetLatestNotificationLogEventId"); *event_id = -1; auto processor = [&](const string& entry_id, const SysNotificationLogEventIdPB& entry_data) { if (entry_id != kLatestNotificationLogEntryIdRowId) { // This is not the row we're looking for. return Status::OK(); } DCHECK(entry_data.has_latest_notification_log_event_id()); DCHECK(entry_data.latest_notification_log_event_id() >= 0); *event_id = entry_data.latest_notification_log_event_id(); return Status::OK(); }; return ProcessRows<SysNotificationLogEventIdPB, HMS_NOTIFICATION_LOG>(processor); } {noformat} {quote} The hms-notification-log-listener thread in Kudu-master will start fetching all HMS notification events due to this. {quote} Yes, and that's what it should be doing in such a situation, as mentioned above. Since there was not any prior interaction with HMS, it uses '-1' and the prior ID to start fetching events from the HMS, effectively fetching the whole history. IIUC, that's by design. If that's not OK, then what's an alternative? BTW, the patch posted at [https://gerrit.cloudera.org/c/22602/|https://gerrit.cloudera.org/c/22602/] has nothing to change that in any way as of PS1, so it wouldn't help with HMS OOM issue mentioned. {quote} In my local env, there are lots of HMS notification events that have a large message body causing HMS OOM to serve these requests {quote} I'd think of filing a JIRA with the HMS/Hive project to make it more robust then. Hitting OOM condition when trying to send out information on just 100 events per batch is worrisome, indeed, even if the whole history of events is being fetched eventually. I'm going to take a closer look to check if there is indeed a race there. Meanwhile, if you suspect the HMS OOM issue might be worked around by fetching HMS events in smaller batches, consider tweaking the {{\-\-hive_metastore_notification_log_batch_size}} [flag for Kudu master|https://kudu.apache.org/docs/configuration_reference.html#kudu-master_hive_metastore_notification_log_batch_size], setting it to a smaller value. > Last processed Hive Metastore notification event ID is not loaded correctly > --------------------------------------------------------------------------- > > Key: KUDU-3649 > URL: https://issues.apache.org/jira/browse/KUDU-3649 > Project: Kudu > Issue Type: Bug > Components: server > Reporter: Quanlong Huang > Priority: Critical > Attachments: kudu-debug.patch, > kudu-master.quanlong-OptiPlex-BJ.quanlong.log.INFO.20250309-094427.22679 > > > While launching kudu-master with Hive Metastore integration enabled, I don't > see the following log: > {code:cpp} > if (hms_catalog_) { > static const char* const kNotificationLogEventIdDescription = > "Loading latest processed Hive Metastore notification log event ID"; > LOG(INFO) << kNotificationLogEventIdDescription << "...";{code} > https://github.com/apache/kudu/blob/e742f86f6d8e687dd02d9891f33e068477163016/src/kudu/master/catalog_manager.cc#L1446-L1447 > The kudu version used in Impala's test env is commit e742f86f6. I tried > patching kudu to add more debug logs by [^kudu-debug.patch]. Then I see the > following logs: > {noformat} > I20250309 09:44:28.930799 22774 catalog_manager.cc:1434] Initializing > in-progress tserver states... > I20250309 09:44:28.930833 22774 catalog_manager.cc:1455] hms_catalog_ is > nullptr > I20250309 09:44:28.930859 22765 hms_catalog.cc:109] Initializing HmsCatalog > I20250309 09:44:28.931007 22790 hms_notification_log_listener.cc:222] > durable_event_id = -1, batch_size = 100 > I20250309 09:44:28.936439 22788 hms_client.cc:369] Fetching 100 HMS events > from id -1{noformat} > This means there is a race between initializing hms_catalog_ > [here|https://github.com/apache/kudu/blob/e742f86f6d8e687dd02d9891f33e068477163016/src/kudu/master/catalog_manager.cc#L1043] > and using it > [here|https://github.com/apache/kudu/blob/e742f86f6d8e687dd02d9891f33e068477163016/src/kudu/master/catalog_manager.cc#L1444]. > hms_notification_log_event_id_ is not loaded correctly and keeps using its > original value -1. The hms-notification-log-listener thread in Kudu-master > will start fetching all HMS notification events due to this. In my local env, > there are lots of HMS notification events that have a large message body > causing HMS OOM to serve these requests. So > HmsNotificationLogListenerTask::Poll() never succeeds and keep polling events > from id -1. Attached the master logs: > [^kudu-master.quanlong-OptiPlex-BJ.quanlong.log.INFO.20250309-094427.22679] > Due to this, creating managed tables in Kudu will failed with the following > error. IMPALA-13846 is an example. > {code} > failed to wait for Hive Metastore notification log listener to catch up: > failed to retrieve notification log events: failed to get Hive Metastore next > notification: No more data to read.{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)