I also agree with moving the hadoop related parts to a separate module.Incidentally, if the filesystem supports concurrency control and atomic operations, wouldn't it be nice to implement an abstract filesystem-based catalog?Let's say we can quickly build a production-ready filesystem-based catalog with minimal dependencies, and it supports data sharing and large metadata tables very well.Batch processing scenarios also work well.
We've been working hard on the design of this thing, and we currently support some of the problems associated with atomicity and concurrency in the filesystem catalogues. Refactoring hadoopTableOptions is part of our feedback to the community. So far, the feedback from our users has been very positive, and they think it's a cool feature to use the filesystem to manage catalogues directly.It has almost no additional dependencies.It is also very user-friendly for manipulating large metadata tables.It is heavily used by our users. Based on the success of our implementations for our users, a file system based implementation of catalogues is actually a very attractive feature. I wonder why the community doesn't want to continue to develop it? And always want to delete it? 在 2024-07-13 02:07:42,"Ryan Blue" <b...@databricks.com.INVALID> 写道: I think one of the main questions is whether we want to support locking strategies moving forward. These were needed in early catalogs that didn't have support for atomic operations (HadoopCatalog and GlueCatalog). Now, Glue supports atomic commits and we have been discouraging the use of HadoopCatalog, which is a purely filesystem-based implementation for a long time. One thing to consider is that external locking does solve a few of the challenges of the filesystem-based approach, but doesn't help with many of the shortcomings of the HadoopCatalog, like being able to atomically delete or rename a table. (Operations that are very commonly used in data engineering!) Maybe we should consider moving Hadoop* classes into a separate iceberg-hadoop module, along with the LockManager to make it work somewhat better. Personally, I'd prefer deprecating HadoopCatalog and HadoopTableOperations because of their serious limitations. But moving these into a separate module seems like a good compromise. That would also avoid needing to add dependencies to core, like Redis for lock implementations. Ryan On Thu, Jul 11, 2024 at 10:42 PM lisoda <lis...@yeah.net> wrote: Currently, the only lockManager implementation in iceberg-core is InMemoryLockManager. This PR extends two LockManager implementations, one based on the Redis, and another based on the Rest-API. In general, most users use redisLockManager is sufficient to cope with most of the scenarios, for redis can not meet the user's requirements, we can let the user to provide a RestApi service to achieve this function. I believe that, for a long time, these two lock-manager's will satisfy most of the customer's needs. If someone could review this PR, that would be great. PR: https://github.com/apache/iceberg/pull/10688 SLACK: https://apache-iceberg.slack.com/archives/C03LG1D563F/p1720761992982729 -- Ryan Blue Databricks