If we come up with a new storage-only catalog implementation that could solve those limitations and also leverage the new features being developed in object storage, would that be a potential alternative strategy? so HadoopCatalog users has a way to move forward with still a storage-only catalog that can run on HDFS, and we can fully deprecate HadoopCatalog.
-Jack On Tue, Jul 23, 2024 at 10:00 AM Ryan Blue <b...@databricks.com.invalid> wrote: > I don't think we would want to put this in a module with other catalog > implementations. It has serious limitations and is actively discouraged, > while the other catalog implementations still have value as either REST > back-end catalogs or as regular catalogs for many users. > > On Tue, Jul 23, 2024 at 9:11 AM Jack Ye <yezhao...@gmail.com> wrote: > >> For some additional information, we also have some Iceberg HDFS users on >> EMR. Those are mainly users that have long-running Hadoop and HBase >> installations. They typically refresh their installation every 1-2 years. >> From my understanding, they use S3 for data storage, but metadata is kept >> in the local HDFS cluster, thus HadoopCatalog works well for them. >> >> I remember we discussed moving all catalog implementations in the main >> repo right now to a separated iceberg-catalogs repo. Could we do this move >> as a part of that effort? >> >> -Jack >> >> On Tue, Jul 23, 2024 at 8:46 AM Ryan Blue <b...@databricks.com.invalid> >> wrote: >> >>> Thanks for the context, lisoda. I agree that it's good to understand the >>> issues you're facing with the HadoopCatalog. One follow up question that I >>> have is what the underlying storage is. Are you using HDFS for those 30,000 >>> customers? >>> >>> I think you're right that there is a challenge to migrating. Because >>> there is no catalog requirement, it's hard to make sure you have all of the >>> writers migrated. I think that means we do need to have a plan or >>> recommendation for people currently using this catalog in production, but >>> it also puts more pressure on us to deprecate this catalog and avoid more >>> people having this problem. >>> >>> I think it's a good idea to make the spec change, which we have >>> agreement for and to ensure that the FS catalog and table operations are >>> properly deprecated to show that they should not be used. I'm not sure >>> whether there is support in the community for moving the implementation >>> into a new iceberg-hadoop module, but at a minimum we can't just remove >>> this right away. I think that a separate iceberg-hadoop module would make >>> the most sense. >>> >>> On Thu, Jul 18, 2024 at 11:09 PM lisoda <lis...@yeah.net> wrote: >>> >>>> Hi team. >>>> I am not a pmc member, just a regular user. Instead of discussing >>>> whether hadoopcatalog needs to continue to exist, I'd like to share a more >>>> practical issue. >>>> >>>> We currently serve over 30,000 customers, all of whom use Iceberg >>>> to store their foundational data, and all business analyses are conducted >>>> based on Iceberg. However, all the Iceberg tables are hadoop_catalog. At >>>> least, this has been the case since I started working with our production >>>> environment system. >>>> >>>> In recent days, I've attempted to migrate hadoop_catalog to >>>> jdbc-catalog, but I failed. We store 2PB of data, and replacing the current >>>> catalogues has become an almost impossible task. Users not only create >>>> hadoop_catalog tables through Spark, they also continuously use third-party >>>> OLAP systems/FLINK and other means to write data into Iceberg in the form >>>> of hadoop_catalog. Given this situation, we can only continue to fix >>>> hadoop_catalog and provide services to customers. >>>> >>>> I understand that the community wants to make a big push into >>>> rest-catalog, and I agree with the direction the community is going.But >>>> considering >>>> that there might be a significant number of users facing similar issues, >>>> can we at least retain a module similar to iceberg-hadoop to extend >>>> hadoop_catalog? If it is removed, we won't be able to continue providing >>>> services to customers. So, if possible, please consider this option. >>>> >>>> Thank you all. >>>> >>>> Kind regards, >>>> lisoda >>>> >>>> >>>> >>>> >>>> >>>> >>>> At 2024-07-19 01:28:18, "Jack Ye" <yezhao...@gmail.com> wrote: >>>> >>>> Thank you for bringing this up Ryan. I have been also in the camp of >>>> saying HadoopCatalog is not recommended, but after thinking about this more >>>> deeply last night, I now have mixed feelings about this topic. Just to >>>> comment on the reasons you listed first: >>>> >>>> * For reason 1 & 2, it looks like the root cause is that people try to >>>> use HadoopCatalog outside native HDFS because there are HDFS connectors to >>>> other storages like S3AFileSystem. However, the norm for such usage has >>>> been that those connectors do not strictly follow HDFS semantics, and it is >>>> assumed that people acknowledge the implication of such usage and accept >>>> the risk. For example, S3AFileSystem was there even before S3 was strongly >>>> consistent, but people have been using that to write files. >>>> >>>> * For reason 3, there are multiple catalogs that do not support all >>>> operations (e.g. Glue for atomic table rename) and people still widely use >>>> it. >>>> >>>> * For reason 4, I see that more as a missing feature. More features >>>> could definitely be developed in that catalog implementation. >>>> >>>> So the key question to me is, how can we prevent people from using >>>> HadoopCatalog outside native HDFS. We know HadoopCatalog is popular because >>>> it is a storage only solution. For object storages specifically, >>>> HadoopCatalog is not suitable for 2 reasons: >>>> >>>> (1) file write does not enforce mutual exclusion, thus cannot enforce >>>> Iceberg optimistic concurrency requirement (a.k.a. cannot do atomic and >>>> swap) >>>> >>>> (2) directory-based design is not preferred in object storage and will >>>> result in bad performance. >>>> >>>> However, now I look at these 2 issues, they are getting outdated. >>>> >>>> (1) object storage is starting to enforce file mutual exclusion. GCS >>>> supports file generation number [1] that increments monotonically, and can >>>> use x-goog-if-generation-match [2] to perform atomic swap. Similar feature >>>> [3] exists in Azure Blob Storage. I cannot speak for the S3 team roadmap. >>>> But Amazon S3 is clearly falling behind in this domain, and with market >>>> competition, it is very clear that similar features will come in reasonably >>>> near future. >>>> >>>> (2) directory bucket is becoming the norm. Amazon S3 announced >>>> directory bucket in 2023 re:invent [4], which does not have the same >>>> performance limitation even if you have very nested folders and many >>>> objects in a folder. GCS also has a similar feature launched in preview [5] >>>> right now. Azure also already has this feature since 2021 [6]. >>>> >>>> With these new developments in the industry, a storage-only Iceberg >>>> catalog becomes very attractive. It is simple with only one service >>>> dependency. It can safely perform atomic compare-and-swap. It is performant >>>> without the need to worry about folder and file organization. If you want >>>> to add additional features for things like access control, there are also >>>> integrations like access grant [7] that can be integrated to do it in a >>>> very scalable way. >>>> >>>> I know the direction in the community so far is to go with the REST >>>> catalog, and I am personally a big advocate for that. However, that >>>> requires either building a full REST catalog, or choosing a catalog vendor >>>> that supports REST. There are many capabilities that REST would unlock, but >>>> those are visions which I expect will take many years down the road for the >>>> community to continue to drive consensus and build those features. If I am >>>> the CTO of a small company and I just want an Iceberg data lake(house) >>>> right now, do I choose REST, or do I choose (or even just build) a >>>> storage-only Iceberg catalog? I feel I would actually choose the later. >>>> >>>> Going back to the discussion points, my current take of this topic is >>>> that: >>>> >>>> (1) +1 for clarifying that HadoopCatalog should only work with HDFS in >>>> the spec. >>>> >>>> (2) +1 if we want to block non-HDFS use cases in HadoopCatalog by >>>> default (e.g. fail if using S3A), but we should allow a feature flag to >>>> unblock the usage so that people can use it after understanding the >>>> implications and risks, just like how people use S3A today. >>>> >>>> (3) +0 for removing HadoopCatalog from the core library. It could be in >>>> a different module like iceberg-hdfs if that is more suitable. >>>> >>>> (4) -1 for moving HadoopCatalog to tests, because HDFS is still a valid >>>> use case for Iceberg. After the measures 1-3 above, people actually having >>>> a HDFS use case should be able to continue to innovate and optimize the >>>> HadoopCatalog implementation. Although "HDFS is becoming much less common", >>>> looking at GitHub issues and discussion forums, it still has a pretty big >>>> user base. >>>> >>>> (5) In general, I propose we separate the discussion of HadoopCatalog >>>> from a "storage only catalog" that also deals with other object stages when >>>> evaluating it. With these latest industry developments, we should evaluate >>>> the direction for building a storage only Iceberg catalog and see if the >>>> community has an interest in that. I could help raise a thread about it >>>> after this discussion is closed. >>>> >>>> Best, >>>> Jack Ye >>>> >>>> [1] >>>> https://cloud.google.com/storage/docs/object-versioning#file_restoration_behavior >>>> [2] >>>> https://cloud.google.com/storage/docs/xml-api/reference-headers#xgoogifgenerationmatch >>>> [3] >>>> https://learn.microsoft.com/en-us/rest/api/storageservices/specifying-conditional-headers-for-blob-service-operations >>>> [4] >>>> https://docs.aws.amazon.com/AmazonS3/latest/userguide/directory-buckets-overview.html >>>> [5] https://cloud.google.com/storage/docs/buckets#enable-hns >>>> [6] >>>> https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace >>>> [7] >>>> https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-grants.html >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Jul 18, 2024 at 7:16 AM Eduard Tudenhöfner < >>>> etudenhoef...@apache.org> wrote: >>>> >>>>> +1 on deprecating now and removing them from the codebase with Iceberg >>>>> 2.0 >>>>> >>>>> On Thu, Jul 18, 2024 at 10:40 AM Ajantha Bhat <ajanthab...@gmail.com> >>>>> wrote: >>>>> >>>>>> +1 on deprecating the `File System Tables` from spec and >>>>>> `HadoopCatalog`, `HadoopTableOperations` in code for now >>>>>> and removing them permanently during 2.0 release. >>>>>> >>>>>> For testing we can use `InMemoryCatalog` as others mentioned. >>>>>> >>>>>> I am not sure about moving to test or keeping them only for HDFS. >>>>>> Because, it leads to confusion to existing users of Hadoop catalog. >>>>>> >>>>>> I wanted to have it deprecated 2 years ago >>>>>> <https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1647950504955309> >>>>>> and I remember that we discussed it in sync that time and left it as it >>>>>> is. >>>>>> Also, when the user brought this up in slack >>>>>> <https://apache-iceberg.slack.com/archives/C03LG1D563F/p1720075009593789?thread_ts=1719993403.208859&cid=C03LG1D563F> >>>>>> recently about lockmanager and refactoring the HadoopTableOperations, >>>>>> I have asked to open this discussion on the mailing list. So, that we >>>>>> can conclude it once and for all. >>>>>> >>>>>> - Ajantha >>>>>> >>>>>> On Thu, Jul 18, 2024 at 12:49 PM Fokko Driesprong <fo...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Hey Ryan and others, >>>>>>> >>>>>>> Thanks for bringing this up. I would be in favor of removing the >>>>>>> HadoopTableOperations, mostly because of the reasons that you already >>>>>>> mentioned, but also about the fact that it is not fully in line with the >>>>>>> first principles of Iceberg (being object store native) as it uses >>>>>>> file-listing. >>>>>>> >>>>>>> I think we should deprecate the HadoopTables to raise the attention >>>>>>> of their users. I would be reluctant to move it to test to just use it >>>>>>> for >>>>>>> testing purposes, I'd rather remove it and replace its use in tests with >>>>>>> the InMemoryCatalog. >>>>>>> >>>>>>> Regarding the StaticTable, this is an easy way to have a read-only >>>>>>> table by directly pointing to the metadata. This also lives in Java >>>>>>> under >>>>>>> StaticTableOperations >>>>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/StaticTableOperations.java>. >>>>>>> It isn't a full-blown catalog where you can list {tables,schemas}, >>>>>>> update tables, etc. As ZENOTME pointed out already, it is all up to the >>>>>>> user, for example, there is no listing of directories to determine which >>>>>>> tables are in the catalog. >>>>>>> >>>>>>> is there a probability that the strategy used by HadoopCatalog is >>>>>>>> not compatible with the table managed by other catalogs? >>>>>>> >>>>>>> >>>>>>> Yes, so they are different, you can see in the spec the section on File >>>>>>> System tables >>>>>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#file-system-tables>, >>>>>>> is used by the HadoopTable implementation. Whereas the other catalogs >>>>>>> follow the Metastore Tables >>>>>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#metastore-tables> >>>>>>> . >>>>>>> >>>>>>> Kind regards, >>>>>>> Fokko >>>>>>> >>>>>>> Op do 18 jul 2024 om 07:19 schreef NOTME ZE <st810918...@gmail.com>: >>>>>>> >>>>>>>> According to our requirements, this function is for some users who >>>>>>>> want to read iceberg tables without relying on any catalogs, I think >>>>>>>> the >>>>>>>> StaticTable may be more flexible and clear in semantics. For >>>>>>>> StaticTable, >>>>>>>> it's the user's responsibility to decide which metadata of the table to >>>>>>>> read. But for read-only HadoopCatalog, the metadata may be decided by >>>>>>>> Catalog, is there a probability that the strategy used by >>>>>>>> HadoopCatalog is >>>>>>>> not compatible with the table managed by other catalogs? >>>>>>>> >>>>>>>> Renjie Liu <liurenjie2...@gmail.com> 于2024年7月18日周四 11:39写道: >>>>>>>> >>>>>>>>> I think there are two ways to do this: >>>>>>>>> 1. As Xuanwo said, we refactor HadoopCatalog to be read only, and >>>>>>>>> throw unsupported operation exception for other operations that >>>>>>>>> manipulate >>>>>>>>> tables. >>>>>>>>> 2. Totally deprecate HadoopCatalog, and add StaticTable as we did >>>>>>>>> in pyiceberg or iceberg-rust. >>>>>>>>> >>>>>>>>> On Thu, Jul 18, 2024 at 11:26 AM Xuanwo <xua...@apache.org> wrote: >>>>>>>>> >>>>>>>>>> Hi, Renjie >>>>>>>>>> >>>>>>>>>> Are you suggesting that we refactor HadoopCatalog as a >>>>>>>>>> FileSystemCatalog to enable direct reading from file systems like >>>>>>>>>> HDFS, S3, >>>>>>>>>> and Azure Blob Storage? This catalog will be read-only that don't >>>>>>>>>> support >>>>>>>>>> write operations. >>>>>>>>>> >>>>>>>>>> On Thu, Jul 18, 2024, at 10:23, Renjie Liu wrote: >>>>>>>>>> >>>>>>>>>> Hi, Ryan: >>>>>>>>>> >>>>>>>>>> Thanks for raising this. I agree that HadoopCatalog is dangerous >>>>>>>>>> in manipulating tables/catalogs given limitations of different file >>>>>>>>>> systems. But I see that there are some users who want to read iceberg >>>>>>>>>> tables without relying on any catalogs, this is also the >>>>>>>>>> motivational use >>>>>>>>>> case of StaticTable in pyiceberg and iceberg-rust, is there similar >>>>>>>>>> things >>>>>>>>>> in java implementation? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Jul 18, 2024 at 7:01 AM Ryan Blue <b...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hey everyone, >>>>>>>>>> >>>>>>>>>> There has been some recent discussion about improving >>>>>>>>>> HadoopTableOperations and the catalog based on those tables, but >>>>>>>>>> we've >>>>>>>>>> discouraged using file system only table (or "hadoop" tables) for >>>>>>>>>> years now >>>>>>>>>> because of major problems: >>>>>>>>>> * It is only safe to use hadoop tables with HDFS; most local file >>>>>>>>>> systems, S3, and other common object stores are unsafe >>>>>>>>>> * Despite not providing atomicity guarantees outside of HDFS, >>>>>>>>>> people use the tables in unsafe situations >>>>>>>>>> * HadoopCatalog cannot implement atomic operations for rename and >>>>>>>>>> drop table, which are commonly used in data engineering >>>>>>>>>> * Alternative file names (for instance when using metadata file >>>>>>>>>> compression) also break guarantees >>>>>>>>>> >>>>>>>>>> While these tables are useful for testing in non-production >>>>>>>>>> scenarios, I think it's misleading to have them in the core module >>>>>>>>>> because >>>>>>>>>> there's an appearance that they are a reasonable choice. I propose we >>>>>>>>>> deprecate the HadoopTableOperations and HadoopCatalog >>>>>>>>>> implementations and >>>>>>>>>> move them to tests the next time we can make breaking API changes >>>>>>>>>> (2.0). >>>>>>>>>> >>>>>>>>>> I think we should also consider similar fixes to the table spec. >>>>>>>>>> It currently describes how HadoopTableOperations works, which does >>>>>>>>>> not work >>>>>>>>>> in object stores or local file systems. HDFS is becoming much less >>>>>>>>>> common >>>>>>>>>> and I propose that we note that the strategy in the spec should ONLY >>>>>>>>>> be >>>>>>>>>> used with HDFS. >>>>>>>>>> >>>>>>>>>> What do other people think? >>>>>>>>>> >>>>>>>>>> Ryan >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Ryan Blue >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Xuanwo >>>>>>>>>> >>>>>>>>>> https://xuanwo.io/ >>>>>>>>>> >>>>>>>>>> >>> >>> -- >>> Ryan Blue >>> Databricks >>> >> > > -- > Ryan Blue > Databricks >