I also agree with moving the hadoop related parts to a separate 
module.Incidentally, if the filesystem supports concurrency control and atomic 
operations, wouldn't it be nice to implement an abstract filesystem-based 
catalog?Let's say we can quickly build a production-ready filesystem-based 
catalog with minimal dependencies, and it supports data sharing and large 
metadata tables very well.Batch processing scenarios also work well.


We've been working hard on the design of this thing, and we currently support 
some of the problems associated with atomicity and concurrency in the 
filesystem catalogues. Refactoring hadoopTableOptions is part of our feedback 
to the community. So far, the feedback from our users has been very positive, 
and they think it's a cool feature to use the filesystem to manage catalogues 
directly.It has almost no additional dependencies.It is also very user-friendly 
for manipulating large metadata tables.It is heavily used by our users.


Based on the success of our implementations for our users, a file system based 
implementation of catalogues is actually a very attractive feature. I wonder 
why the community doesn't want to continue to develop it? And always want to 
delete it?











在 2024-07-13 02:07:42,"Ryan Blue" <b...@databricks.com.INVALID> 写道:

I think one of the main questions is whether we want to support locking 
strategies moving forward. These were needed in early catalogs that didn't have 
support for atomic operations (HadoopCatalog and GlueCatalog). Now, Glue 
supports atomic commits and we have been discouraging the use of HadoopCatalog, 
which is a purely filesystem-based implementation for a long time.


One thing to consider is that external locking does solve a few of the 
challenges of the filesystem-based approach, but doesn't help with many of the 
shortcomings of the HadoopCatalog, like being able to atomically delete or 
rename a table. (Operations that are very commonly used in data engineering!)


Maybe we should consider moving Hadoop* classes into a separate iceberg-hadoop 
module, along with the LockManager to make it work somewhat better. Personally, 
I'd prefer deprecating HadoopCatalog and HadoopTableOperations because of their 
serious limitations. But moving these into a separate module seems like a good 
compromise. That would also avoid needing to add dependencies to core, like 
Redis for lock implementations.


Ryan


On Thu, Jul 11, 2024 at 10:42 PM lisoda <lis...@yeah.net> wrote:

Currently, the only lockManager implementation in iceberg-core is 
InMemoryLockManager. This PR extends two LockManager implementations, one based 
on the Redis, and another based on the Rest-API.
In general, most users use redisLockManager is sufficient to cope with most of 
the scenarios, for redis can not meet the user's requirements, we can let the 
user to provide a RestApi service to achieve this function. I believe that, for 
a long time, these two lock-manager's will satisfy most of the customer's needs.


If someone could review this PR, that would be great.


PR: https://github.com/apache/iceberg/pull/10688
SLACK: https://apache-iceberg.slack.com/archives/C03LG1D563F/p1720761992982729




--

Ryan Blue
Databricks

Reply via email to