Re:Re: [DISCUSS] Changing Default Table Format to Iceberg in Upcoming Releases

lisoda Tue, 22 Apr 2025 03:11:41 -0700

Yea. I think so.
Additionally, I have a small suggestion. Assuming the default table format in 
HIVE changes to a data lake format (ICEBERG/PAIMON, etc.), do we need to 
further reduce the complexity required for deploying HMS? For example, in the 
case of HMS managing a data lake, it would only rely on the file system.

Why am I considering this? Here’s why:

Currently, in many cases, HMS (Hive Metastore) relies on the NOTIFICATION_LOG 
table to synchronize metadata. However, if we use a data lake format (like 
Iceberg or Paimon), the metadata is already stored within the data lake tables 
themselves. This eliminates the need for such a synchronization mechanism.

Since the data lake already handles metadata management, does HMS still need to 
maintain such a complex metadata management model?

In real-world production environments, users often try to connect HMS with 
various heavily modified or unconventional database versions, leading to all 
sorts of strange and unpredictable issues.

Thus, if HMS is used for data lake management, could we design it such that the 
file system alone handles basic table management, with HMS providing only 
supplementary capabilities for advanced governance?

Example: HMS+FileSystemCatalogSDK(It works on most file systems).

Assume that we have such an SDK (provided by HIVE, which can implement data 
lake management only relying on the file system). Then the benefits of doing so 
are:

Currently, the message queue is being integrated with the data lake. Compared 
with integrating all kinds of strange third - party catalogs, the maintenance 
team of the message queue will definitely prefer to use the 
FileSystemCatalogSDK, because this has the lowest cost and does not introduce 
any third - party dependencies that may cause conflicts.
It becomes possible for users to quickly switch catalogs. Suppose a user 
previously used the combination of REST + FileSystemCatalogSDK, and now he 
wants to use HMS to manage the data lake. We only need to make some simple 
configurations in HMS, and HMS can quickly take over the original data lake 
tables. (Because the metadata information can be recognized by the SDKs, it's 
just a file.)
We may re - define the metadata management specification standard of the data 
lake. After all, the rest - catalog can only let the data lake tables exist in 
the computing engine as second - class citizens (with limited high - level 
functions and unable to enjoy some optimization features of the computing 
engine for internal tables). It is entirely possible for the computing engine 
to make the data lake tables exist in the computing engine as first - class 
citizens by extending and being compatible with the FileSystemCatalogSDK.
Avoid the terrible implementation of catalog projects. For example, Iceberg - 
GlueCatalog adds a distributed lock on the client, which is a very poor 
implementation. If the FileSystemCatalogSDK provided by HIVE can complete the 
management of the catalog, then GlueCatalog itself doesn't need to do anything.

Currently, there are a number of attempts in the industry to move in this 
direction:
https://olympiaformat.org/
https://docs.google.com/presentation/d/1Y5tlHcB_ViqDDE-656ciFy75UJkOrgps
https://slatedb.io

We'd like to hear from the HIVE community about this research direction.

That's all.
Tks.
Lisoda.

At 2025-04-14 20:07:05, "Shohei Okumiya" <oku...@apache.org> wrote:
>Hi,
>
>I'm thrilled to see various opinions in this thread! I respect Ayush
>for initiating the discussion with the brave proposal and am proud of
>all the community members here.
>
>I am also aware of one interesting point of this thread: we believe in
>the potential coverage of Apache Hive. Although the original proposal
>is very simple, some people mentioned the new features of Hive
>Metastore, some were concerned about the lack of some maintenance
>features, some wanted to know the performance of Iceberg tables, and
>some pushed integration with external catalogs. The sequence of
>comments here sounds unique to Hive.
>
>As the discussion inspired me, I also tried to draw one vision. My
>question: Can Hive be an Operating System or DBMS for the Open Table
>Format or Data Lakehouse?
>https://docs.google.com/document/d/1tKFmsjYeGlMQjvJ7QQDNcS5wvqrcHRvsDjbJlcHb7Gk/edit?usp=sharing
>
>The above document also summarizes diverse topics in this thread.
>Please read it if you're interested, and please feel free to comment
>on it.
>
>Lastly, I apologize for throwing in one more divergent reply.
>
>Regards,
>Okumin
>
>On Sun, Apr 13, 2025 at 1:32 AM Stamatis Zampetakis <zabe...@gmail.com> wrote:
>>
>> Iceberg gets a lot of traction and the integration with Hive becomes more 
>> and more mature so it makes sense to start the discussion about making it as 
>> the default choice.
>>
>> However, I feel that it may be a bit too soon to do the switch right now. 
>> Apart from performance numbers our Iceberg test coverage is rather limited 
>> currently in Hive. The vast majority of tests are running using other 
>> formats so before making it the default maybe we should first try to migrate 
>> the tests to use that.
>>
>> Moreover, the choice of a default format is tricky and varies from one use 
>> case to the other so I am not sure if there exists one that overpowers the 
>> rest in every aspect. For instance many people believed that adopting ACID 
>> tables for everything was a good idea but soon after users started migrating 
>> their workloads to ACID we started hitting many performance and scalability 
>> challenges. The same reasoning applies for choices/debates between ORC, 
>> Parquet, etc .
>>
>> All in all, I wouldn't push very hard for one particular format and I would 
>> prefer to leave the choice to the end-user who knows best their use case. 
>> Having said that I am willing to follow and support the decision of the 
>> community especially those people who contributed significantly to the 
>> Iceberg integration.
>>
>> Best,
>> Stamatis
>>
>>
>> On Wed, Apr 9, 2025, 4:08 PM Denys Kuzmenko <dkuzme...@apache.org> wrote:
>>>
>>> Hi,
>>>
>>> I'm a bit hesitant switching to Iceberg as the default atm. I lean more 
>>> toward setting the default table format at the database level instead.
>>>
>>> Hive Iceberg currently lacks automatic table maintenance, comprehensive 
>>> support for partition-level statistics, and various partition-aware 
>>> optimizations (see HIVE-28410)
>>>
>>> Moreover, we haven't conducted any performance testing so far. It would be 
>>> helpful to first assess where we currently stand before making a final 
>>> decision.
>>>
>>> Regards,
>>> Denys
>>>

Re:Re: [DISCUSS] Changing Default Table Format to Iceberg in Upcoming Releases

Reply via email to