Discuss: Re implement a catalog based on file system for metadata management

lisoda Tue, 31 Dec 2024 00:11:09 -0800

Previous summary:
[HIVE-28373] Iceberg: Refactor the code of HadoopTableOptions - ASF JIRA
[HIVE-28621] Re implement a catalog based on file system for metadata 
management - ASF JIRA




Background Story:

With the launch of HIVE4, we are eagerly looking forward to upgrading our 
existing HIVE to gain new features and community support. However, 
unfortunately, HIVE3 not only manages all the foundational data, but also the 
peripheral systems that rely on HIVE3, such as Spark/mpp/trino. These 
peripheral systems are currently essentially only compatible with HIVE3. 
Therefore, upgrading HIVE in a conventional manner is extremely difficult. 
Moreover, we also need to consider the potential data corruption issues that 
may be introduced by upgrading HIVE. Faced with a series of problems, upgrading 
HIVE seems to be an impossible task.




By analyzing the current situation, we have drawn some conclusions:

Under traditional circumstances, since user data is bound together with the 
engine, significant adjustments to the computing engine may lead to unintended 
data corruption.

Data is forcibly bound with the engine, but if the engine cannot accommodate 
the user's workload, switching engines is also a very difficult task.

If the current engine still has peripheral systems, adjusting the engine 
becomes even more impossible. A minor change could cause the entire system to 
malfunction.




Therefore, all the contradictions point to a single issue: data is forcibly 
bound with the engine, leading to a series of problems.

If data exists independently of the engine, then we can completely disregard 
issues related to engine upgrades or modifications. We could immediately set up 
a new computing engine and use it to read data. In this way, a series of 
annoying problems would vanish into thin air. We could deploy any number of 
different types of computing engines; whatever the users need, we deploy. With 
the cooperation of scheduling systems like YARN/K8S, we could essentially 
eliminate the original engine maintenance costs. In this way, everything would 
look great.




After our research, we found that iceberg-hadoop-catalog basically meets our 
requirements. Therefore, we began migrating all our foundational data to the 
iceberg-hadoop-catalog.After three months of work, we successfully migrated all 
our foundational data to the ICEBERG-Hadoop-Catalog. We then deployed a 
brand-new HIVE4 and used external tables to read data from the iceberg tables. 
The migration of our services went smoothly, and after four months, we 
completely resolved the issue of upgrading the historical system to HIVE4. We 
also completely decoupled the engine from the data.

However, some unexpected situations arose in the iceberg community: when I 
tried to submit a patch to fix the iceberg-hadoop-catalog, I was informed that 
the community does not plan to continue maintaining the hadoop-catalog and 
intends to remove it in iceberg 2.0. This proposal sparked widespread 
discussion within the community. By reviewing the community's mailing list, we 
found that many users actually have the same demands; they all hope to manage 
iceberg without relying on any superfluous infrastructure. As a compromise, 
iceberg-rust retained the static-table (equivalent to a read-only version of 
HadoopCatalog, which does not allow table updates). Moreover, there are 
continuous proposals in the community related to FileSystemCatalog features, 
such as enhancing multi-engine interoperability, etc.

Example:
[DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0-Apache Mail 
Archives
Storing catalog directly on object store-Apache Mail Archives
[Proposal] Replicating version-hint onto the file system-Apache Mail Archives
trinitylake-io/trinitylake: Open LakeHouse Format for Big Data Analytics, ML & 
AI



However, the dominant direction of the iceberg community seems to be different; 
they prefer to use the rest-catalog. But the rest-catalog still requires actual 
storage, and not only does it not reduce the cost of infrastructure maintenance 
for users, it actually increases the complexity of the system. Although the 
rest-catalog indeed has its advantages, I find it hard to convince myself to 
choose it without necessity.

Therefore, against this backdrop, we want to make the FileSystemCatalog better 
and more aligned with the real demands of users. Since HIVE extensively uses 
features related to HadoopCatalog to implement features similar to external 
tables, we believe it is at least necessary to introduce a better 
FileSystemCatalog in HIVE.




The focus of the issue:

The traditional fs-catalog uses rename operations to ensure the reliability of 
commits, but object storage and POSIX protocol file systems do not support 
exclusive rename operations. (This is the main reason why the iceberg community 
believes that fs-catalog cannot be well implemented.)

All catalogs actually face the issue described in [HIVE-28366],(I don't believe 
that the current implementations of rest-catalog/jdbc-catalog or other catalogs 
can solve this problem well.) I believe Denys Kuzmenko has a lot of experience 
with this issue.

Multi-table transactions are not well supported at present.




Solutions to the issues:

For issue 1, we have developed a prototype that uses appendFile + a limited 
range list + a two-phase commit approach to achieve reliable commits on all 
common file systems. The file system only needs to support the characteristic 
of making files visible to all clients immediately after writing and support 
the functionality of listing files. Although many users are not keen on using 
list operations in object storage, in reality, if the result of the list 
operation only has a few dozen entries, the cost is not as high as imagined.

For issue 2, this is a painful problem, and we believe this part requires 
discussion. We are currently inclined to restrict the version numbers that 
clients can commit, for example, clients cannot commit versions greater than 5 
before 8 a.m. tomorrow. (We might also need to introduce a scheme similar to 
read-write locks? However, this part requires detailed discussion.)

Perhaps we need to consolidate all table metadata into the same file, so that 
all our commit operations are on the same metadata.json file. In this case, it 
would be very easy to support multi-table transactions.



I would like to know how the HIVE community views this issue. Looking forward 
to your reply!


-Lisoda

Discuss: Re implement a catalog based on file system for metadata management

Reply via email to