Currently, there are several implementations of Catalog that need to use 
LockManager in their code: HadoopCatalog, GlueCatalog.
For GlueCatalog, because I do not use it a lot, I do not do any evaluation of 
it.
For HadoopCatalog, it works fine without LockManager and guarantees atomicity 
of commits. The reason LockManager is currently in the code is simply because 
it was added by the previous author. We can remove HadoopCatalog's use of 
LockManager if needed.











At 2024-07-13 04:52:34, "Amogh Jahagirdar" <2am...@gmail.com> wrote:

I have pretty similar concerns as Ryan. I don't think we should be adding any 
new locking implementations to the project since at the moment the only catalog 
which requires it for atomic operations is HadoopCatalog (and it doesn't even 
completely address all the cases). Every locking implementation that gets added 
to the project pulls in more dependencies and is another dimension of 
complexity. There were previous discussions in community syncs to deprecate the 
HadoopCatalog, but if there's a strong desire for retaining it in the project, 
then a separate module is another approach to isolate that complexity.


Thanks,


Amogh Jahagirdar


On Fri, Jul 12, 2024 at 12:08 PM Ryan Blue <b...@databricks.com.invalid> wrote:

I think one of the main questions is whether we want to support locking 
strategies moving forward. These were needed in early catalogs that didn't have 
support for atomic operations (HadoopCatalog and GlueCatalog). Now, Glue 
supports atomic commits and we have been discouraging the use of HadoopCatalog, 
which is a purely filesystem-based implementation for a long time.


One thing to consider is that external locking does solve a few of the 
challenges of the filesystem-based approach, but doesn't help with many of the 
shortcomings of the HadoopCatalog, like being able to atomically delete or 
rename a table. (Operations that are very commonly used in data engineering!)


Maybe we should consider moving Hadoop* classes into a separate iceberg-hadoop 
module, along with the LockManager to make it work somewhat better. Personally, 
I'd prefer deprecating HadoopCatalog and HadoopTableOperations because of their 
serious limitations. But moving these into a separate module seems like a good 
compromise. That would also avoid needing to add dependencies to core, like 
Redis for lock implementations.


Ryan


On Thu, Jul 11, 2024 at 10:42 PM lisoda <lis...@yeah.net> wrote:

Currently, the only lockManager implementation in iceberg-core is 
InMemoryLockManager. This PR extends two LockManager implementations, one based 
on the Redis, and another based on the Rest-API.
In general, most users use redisLockManager is sufficient to cope with most of 
the scenarios, for redis can not meet the user's requirements, we can let the 
user to provide a RestApi service to achieve this function. I believe that, for 
a long time, these two lock-manager's will satisfy most of the customer's needs.


If someone could review this PR, that would be great.


PR: https://github.com/apache/iceberg/pull/10688
SLACK: https://apache-iceberg.slack.com/archives/C03LG1D563F/p1720761992982729




--

Ryan Blue
Databricks

Reply via email to