Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-31 Thread Ryan Blue
I think “is it worth it” is the right question. We *could* use putIfAbsent in some back-ends, exclusive file creation in others, atomic renames in some more. We *could* come up with an API that abstracts those things. In the end we would end up with a situation in which: - These operations need

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-31 Thread Jack Ye
I guess the problem with an OVERWRITE flag for rename is that, with this flag, file mutual exclusion seems to be more difficult to enforce, and the difference among file systems becomes really nuanced. If 2 writers both have OVERWRITE flag on, then it seems like the file system should just let one

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-31 Thread Jack Ye
Oh I remember now, I think it was because HDFS semantics of rename fails when a file already exists. However, I think in the latest HDFS with FileContext API, an OVERWRITE flag can be passed to the context to make the rename succeed [1]: > If OVERWRITE option is not passed as an argument, rename f

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-31 Thread Russell Spitzer
My guess would be to avoid complications with multiple committers attempting to swap at the same time. On Wed, Jul 31, 2024 at 9:50 AM Jack Ye wrote: > I see, thank you Fokko, this is a very helpful context. > > Looking at the discussion in the PR and discussions in it, it seems like > the versi

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-31 Thread Jack Ye
I see, thank you Fokko, this is a very helpful context. Looking at the discussion in the PR and discussions in it, it seems like the version hint file is the key problem here. The file system table spec [1] is technically correct and only uses a single rename operation to perform the atomic commit

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-30 Thread Fokko Driesprong
Jack, no atomic drop table support: this seems pretty fixable, as you can change > the semantics of dropping a table to be deleting the latest table version > hint file, instead of having to delete everything in the folder. I feel > that actually also fits the semantics of purge/no-purge better.

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-30 Thread Jack Ye
Atomicity is just one requirement, and it also needs to be efficient, desirably a metadata-only operation. Looking at some documentations of GCS [1], the rename operation is still a COPY + DELETE behind the scene unless it is a hierarchical namespace bucket. The Azure documentation [2] also sugges

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-30 Thread Steve Loughran
On Thu, 18 Jul 2024 at 00:02, Ryan Blue wrote: > Hey everyone, > > There has been some recent discussion about improving > HadoopTableOperations and the catalog based on those tables, but we've > discouraged using file system only table (or "hadoop" tables) for years now > because of major proble

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-30 Thread Jack Ye
long as the file system supports atomic single-file writing, where the >>>>> file >>>>> becomes immediately visible upon the client's successful write operation, >>>>> that is sufficient. We can do without the rename operation as long as the >>>

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-30 Thread Jean-Baptiste Onofré
n do without the rename operation as long as the >>>> file system guarantees this feature. Of course, if the object storage >>>> system supports mutex operations, we can also uniformly use the rename >>>> operation for committing. We can theoretically avoi

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-30 Thread Jean-Baptiste Onofré
n also uniformly use the rename >>> operation for committing. We can theoretically avoid the situation of >>> providing a large number of commit strategies for different file systems. >>> Replied Message >>> From Jack Ye >>> Date 07/24/2024 02:52 &

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-30 Thread Fokko Driesprong
. We can theoretically avoid the situation of >>> providing a large number of commit strategies for different file systems. >>> Replied Message >>> From Jack Ye >>> Date 07/24/2024 02:52 >>> To dev@iceberg.apache.org >>> Cc >>

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-30 Thread Gabor Kaszab
ion for committing. We can theoretically avoid the situation of >> providing a large number of commit strategies for different file systems. >> Replied Message ---- >> From Jack Ye >> Date 07/24/2024 02:52 >> To dev@iceberg.apache.org >> Cc >>

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-25 Thread Ryan Blue
e 07/24/2024 02:52 > To dev@iceberg.apache.org > Cc > Subject Re: Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests > in 2.0 > If we come up with a new storage-only catalog implementation that could > solve those limitations and also leverage the new features being developed &g

Re: Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-24 Thread Jean-Baptiste Onofré
Hi guys, Sorry for the later reply in this thread. I think Ryan's proposal is reasonable as we clearly have "questions/limitations" about HadoopTableOperations. However, it seems we have production-level systems using it and also potential improvements ahead. @Jack, I'm not sure moving to a sepa

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-23 Thread lisoda
plied Message | From | Jack Ye | | Date | 07/24/2024 02:52 | | To | dev@iceberg.apache.org | | Cc | | | Subject | Re: Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0 | If we come up with a new storage-only catalog implementation that could solve those limitations and

Re: Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-23 Thread Jack Ye
If we come up with a new storage-only catalog implementation that could solve those limitations and also leverage the new features being developed in object storage, would that be a potential alternative strategy? so HadoopCatalog users has a way to move forward with still a storage-only catalog th

Re: Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-23 Thread Ryan Blue
I don't think we would want to put this in a module with other catalog implementations. It has serious limitations and is actively discouraged, while the other catalog implementations still have value as either REST back-end catalogs or as regular catalogs for many users. On Tue, Jul 23, 2024 at 9

Re: Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-23 Thread Jack Ye
For some additional information, we also have some Iceberg HDFS users on EMR. Those are mainly users that have long-running Hadoop and HBase installations. They typically refresh their installation every 1-2 years. >From my understanding, they use S3 for data storage, but metadata is kept in the lo

Re: Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-23 Thread Ryan Blue
Thanks for the context, lisoda. I agree that it's good to understand the issues you're facing with the HadoopCatalog. One follow up question that I have is what the underlying storage is. Are you using HDFS for those 30,000 customers? I think you're right that there is a challenge to migrating. Be

Re:Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-18 Thread lisoda
Hi team. I am not a pmc member, just a regular user. Instead of discussing whether hadoopcatalog needs to continue to exist, I'd like to share a more practical issue. We currently serve over 30,000 customers, all of whom use Iceberg to store their foundational data, and all business

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-18 Thread Ryan Blue
I am not fully sold that object storage issues have been solved. S3 directory bucket is not a general purpose bucket and lives in a single zone. The data durability guarantee may not work for many use cases. We don’t know when S3 will add the atomic renaming support. I agree with Steven here that

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-18 Thread Steven Wu
Thanks Jack for the thoughtful comments. I am not fully sold that object storage issues have been solved. S3 directory bucket is not a general purpose bucket and lives in a single zone. The data durability guarantee may not work for many use cases. We don't know when S3 will add the atomic renamin

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-18 Thread John Zhuge
Appreciate the thoughtful comments! On Thu, Jul 18, 2024 at 10:29 AM Jack Ye wrote: > Thank you for bringing this up Ryan. I have been also in the camp of > saying HadoopCatalog is not recommended, but after thinking about this more > deeply last night, I now have mixed feelings about this to

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-18 Thread Jack Ye
Thank you for bringing this up Ryan. I have been also in the camp of saying HadoopCatalog is not recommended, but after thinking about this more deeply last night, I now have mixed feelings about this topic. Just to comment on the reasons you listed first: * For reason 1 & 2, it looks like the roo

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-18 Thread Eduard Tudenhöfner
+1 on deprecating now and removing them from the codebase with Iceberg 2.0 On Thu, Jul 18, 2024 at 10:40 AM Ajantha Bhat wrote: > +1 on deprecating the `File System Tables` from spec and `HadoopCatalog`, > `HadoopTableOperations` in code for now > and removing them permanently during 2.0 release

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-18 Thread Ajantha Bhat
+1 on deprecating the `File System Tables` from spec and `HadoopCatalog`, `HadoopTableOperations` in code for now and removing them permanently during 2.0 release. For testing we can use `InMemoryCatalog` as others mentioned. I am not sure about moving to test or keeping them only for HDFS. Becau

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-18 Thread Fokko Driesprong
Hey Ryan and others, Thanks for bringing this up. I would be in favor of removing the HadoopTableOperations, mostly because of the reasons that you already mentioned, but also about the fact that it is not fully in line with the first principles of Iceberg (being object store native) as it uses fi

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-17 Thread NOTME ZE
According to our requirements, this function is for some users who want to read iceberg tables without relying on any catalogs, I think the StaticTable may be more flexible and clear in semantics. For StaticTable, it's the user's responsibility to decide which metadata of the table to read. But for

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-17 Thread Daniel Weeks
In Java, I think you're looking for the 'Tables' interface and 'HadoopTables' implementation for just directly loading a table from a location. On Wed, Jul 17, 2024 at 8:48 PM Renjie Liu wrote: > I think there are two ways to do this: > 1. As Xuanwo said, we refactor HadoopCatalog to be read on

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-17 Thread Renjie Liu
I think there are two ways to do this: 1. As Xuanwo said, we refactor HadoopCatalog to be read only, and throw unsupported operation exception for other operations that manipulate tables. 2. Totally deprecate HadoopCatalog, and add StaticTable as we did in pyiceberg or iceberg-rust. On Thu, Jul 18

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-17 Thread Xuanwo
Hi, Renjie Are you suggesting that we refactor HadoopCatalog as a FileSystemCatalog to enable direct reading from file systems like HDFS, S3, and Azure Blob Storage? This catalog will be read-only that don't support write operations. On Thu, Jul 18, 2024, at 10:23, Renjie Liu wrote: > Hi, Ryan:

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-17 Thread Manu Zhang
Is it possible to move HadoopCatalog to a separate repo and let it evolve on its own? On Thu, Jul 18, 2024 at 10:31 AM Renjie Liu wrote: > Hi, Ryan: > > Thanks for raising this. I agree that HadoopCatalog is dangerous in > manipulating tables/catalogs given limitations of different file systems.

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-17 Thread Renjie Liu
Hi, Ryan: Thanks for raising this. I agree that HadoopCatalog is dangerous in manipulating tables/catalogs given limitations of different file systems. But I see that there are some users who want to read iceberg tables without relying on any catalogs, this is also the motivational use case of Sta

[DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-17 Thread Ryan Blue
Hey everyone, There has been some recent discussion about improving HadoopTableOperations and the catalog based on those tables, but we've discouraged using file system only table (or "hadoop" tables) for years now because of major problems: * It is only safe to use hadoop tables with HDFS; most l