Previous summary: [HIVE-28373] Iceberg: Refactor the code of HadoopTableOptions - ASF JIRA [HIVE-28621] Re implement a catalog based on file system for metadata management - ASF JIRA
Background Story: With the launch of HIVE4, we are eagerly looking forward to upgrading our existing HIVE to gain new features and community support. However, unfortunately, HIVE3 not only manages all the foundational data, but also the peripheral systems that rely on HIVE3, such as Spark/mpp/trino. These peripheral systems are currently essentially only compatible with HIVE3. Therefore, upgrading HIVE in a conventional manner is extremely difficult. Moreover, we also need to consider the potential data corruption issues that may be introduced by upgrading HIVE. Faced with a series of problems, upgrading HIVE seems to be an impossible task. By analyzing the current situation, we have drawn some conclusions: Under traditional circumstances, since user data is bound together with the engine, significant adjustments to the computing engine may lead to unintended data corruption. Data is forcibly bound with the engine, but if the engine cannot accommodate the user's workload, switching engines is also a very difficult task. If the current engine still has peripheral systems, adjusting the engine becomes even more impossible. A minor change could cause the entire system to malfunction. Therefore, all the contradictions point to a single issue: data is forcibly bound with the engine, leading to a series of problems. If data exists independently of the engine, then we can completely disregard issues related to engine upgrades or modifications. We could immediately set up a new computing engine and use it to read data. In this way, a series of annoying problems would vanish into thin air. We could deploy any number of different types of computing engines; whatever the users need, we deploy. With the cooperation of scheduling systems like YARN/K8S, we could essentially eliminate the original engine maintenance costs. In this way, everything would look great. After our research, we found that iceberg-hadoop-catalog basically meets our requirements. Therefore, we began migrating all our foundational data to the iceberg-hadoop-catalog.After three months of work, we successfully migrated all our foundational data to the ICEBERG-Hadoop-Catalog. We then deployed a brand-new HIVE4 and used external tables to read data from the iceberg tables. The migration of our services went smoothly, and after four months, we completely resolved the issue of upgrading the historical system to HIVE4. We also completely decoupled the engine from the data. However, some unexpected situations arose in the iceberg community: when I tried to submit a patch to fix the iceberg-hadoop-catalog, I was informed that the community does not plan to continue maintaining the hadoop-catalog and intends to remove it in iceberg 2.0. This proposal sparked widespread discussion within the community. By reviewing the community's mailing list, we found that many users actually have the same demands; they all hope to manage iceberg without relying on any superfluous infrastructure. As a compromise, iceberg-rust retained the static-table (equivalent to a read-only version of HadoopCatalog, which does not allow table updates). Moreover, there are continuous proposals in the community related to FileSystemCatalog features, such as enhancing multi-engine interoperability, etc. Example: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0-Apache Mail Archives Storing catalog directly on object store-Apache Mail Archives [Proposal] Replicating version-hint onto the file system-Apache Mail Archives trinitylake-io/trinitylake: Open LakeHouse Format for Big Data Analytics, ML & AI However, the dominant direction of the iceberg community seems to be different; they prefer to use the rest-catalog. But the rest-catalog still requires actual storage, and not only does it not reduce the cost of infrastructure maintenance for users, it actually increases the complexity of the system. Although the rest-catalog indeed has its advantages, I find it hard to convince myself to choose it without necessity. Therefore, against this backdrop, we want to make the FileSystemCatalog better and more aligned with the real demands of users. Since HIVE extensively uses features related to HadoopCatalog to implement features similar to external tables, we believe it is at least necessary to introduce a better FileSystemCatalog in HIVE. The focus of the issue: The traditional fs-catalog uses rename operations to ensure the reliability of commits, but object storage and POSIX protocol file systems do not support exclusive rename operations. (This is the main reason why the iceberg community believes that fs-catalog cannot be well implemented.) All catalogs actually face the issue described in [HIVE-28366],(I don't believe that the current implementations of rest-catalog/jdbc-catalog or other catalogs can solve this problem well.) I believe Denys Kuzmenko has a lot of experience with this issue. Multi-table transactions are not well supported at present. Solutions to the issues: For issue 1, we have developed a prototype that uses appendFile + a limited range list + a two-phase commit approach to achieve reliable commits on all common file systems. The file system only needs to support the characteristic of making files visible to all clients immediately after writing and support the functionality of listing files. Although many users are not keen on using list operations in object storage, in reality, if the result of the list operation only has a few dozen entries, the cost is not as high as imagined. For issue 2, this is a painful problem, and we believe this part requires discussion. We are currently inclined to restrict the version numbers that clients can commit, for example, clients cannot commit versions greater than 5 before 8 a.m. tomorrow. (We might also need to introduce a scheme similar to read-write locks? However, this part requires detailed discussion.) Perhaps we need to consolidate all table metadata into the same file, so that all our commit operations are on the same metadata.json file. In this case, it would be very easy to support multi-table transactions. I would like to know how the HIVE community views this issue. Looking forward to your reply! -Lisoda