Re:Re: Re: Core:support redis and http lock-manager

lisoda Sun, 14 Jul 2024 19:45:17 -0700

Sir. Following this PR, we can modify hadoopTableOptions to support atomic 
commits based on filesystem catalogues without distributed locks..core:Refactor 
the code of HadoopTableOptions by BsoBird · Pull Request #10623 · 
apache/iceberg (github.com)












在 2024-07-15 00:08:26，"Daniel Weeks" <dwe...@apache.org> 写道：

Iisoda,


Unfortunately, I don't agree with your assessment.  The problems with file 
system based catalog implementations are inherent and steps taken to address 
them are not adequate to have confidence in the implementation.


Commit atomicity is not solved as it relies on locking, which has a number of 
fundamental issues that will continue to plague these implementations.  Locking 
distributes the complexity to clients who all need to participate properly, but 
issues like lost locks, lock timeouts, clock skew and lock sequencing/deadlocks 
(for more complicated commit scenarios) are all introduced as new issues. These 
are examples of problems that exist in Hive locking today.  I don't think we 
can call this solved.


Beyond that, there are a whole set of operations like CTAS, RTAS, IF [NOT] 
EXISTS, DROP, RENAME, etc. that are not atomic with this model.  These are all 
important functions used in data warehouses and the more you try to solve 
these, the more you end up relying on external systems to track table state and 
metadata, which ends up not being a file system based catalog.


Adding complexity to file system catalog implementations by introducing more 
dependencies to try to address these issues just confuses users by giving the 
impression that this is a legitimate alternative to a real catalog 
implementation.


I don't think we should be adding functionality here and we should probably 
deprecate or relocate the file system catalog to the test codebase like the 
in-memory catalog.


-Dan






On Sun, Jul 14, 2024 at 4:12 AM lisoda <lis...@yeah.net> wrote:

At present, the file system based catalogues have the following problems (this 
is what I can think of at the moment, perhaps not comprehensive).
1. does not support renaming operations
2. commit does not support atomicity
3. atomic delete (sorry I don't understand why we need it?What scenarios need 
this?)?
4.Some features are missing, such as: support for views.


We have solved issue 2, and issue 1 is tolerated by users for the time being. 
Regarding issue 3, we don't know what the specific problem is that it relates 
to. 
From what I've seen so far, I think most of the problems can actually be 
solved.Filesystem-based catalogues can achieve at least a read-commit level of 
isolation.For OLAP systems, it is sufficient for most scenarios.


I'm not sure if I'm being overly optimistic about this issue, so please let me 
know if there's anything wrong with my opinion.Thank you.











在 2024-07-13 02:07:42，"Ryan Blue" <b...@databricks.com.INVALID> 写道：

I think one of the main questions is whether we want to support locking 
strategies moving forward. These were needed in early catalogs that didn't have 
support for atomic operations (HadoopCatalog and GlueCatalog). Now, Glue 
supports atomic commits and we have been discouraging the use of HadoopCatalog, 
which is a purely filesystem-based implementation for a long time.


One thing to consider is that external locking does solve a few of the 
challenges of the filesystem-based approach, but doesn't help with many of the 
shortcomings of the HadoopCatalog, like being able to atomically delete or 
rename a table. (Operations that are very commonly used in data engineering!)


Maybe we should consider moving Hadoop* classes into a separate iceberg-hadoop 
module, along with the LockManager to make it work somewhat better. Personally, 
I'd prefer deprecating HadoopCatalog and HadoopTableOperations because of their 
serious limitations. But moving these into a separate module seems like a good 
compromise. That would also avoid needing to add dependencies to core, like 
Redis for lock implementations.


Ryan


On Thu, Jul 11, 2024 at 10:42 PM lisoda <lis...@yeah.net> wrote:

Currently, the only lockManager implementation in iceberg-core is 
InMemoryLockManager. This PR extends two LockManager implementations, one based 
on the Redis, and another based on the Rest-API.
In general, most users use redisLockManager is sufficient to cope with most of 
the scenarios, for redis can not meet the user's requirements, we can let the 
user to provide a RestApi service to achieve this function. I believe that, for 
a long time, these two lock-manager's will satisfy most of the customer's needs.


If someone could review this PR, that would be great.


PR: https://github.com/apache/iceberg/pull/10688
SLACK: https://apache-iceberg.slack.com/archives/C03LG1D563F/p1720761992982729




--

Ryan Blue
Databricks

Re:Re: Re: Core:support redis and http lock-manager

Reply via email to