Re: Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Jack Ye Tue, 23 Jul 2024 11:56:14 -0700

If we come up with a new storage-only catalog implementation that could
solve those limitations and also leverage the new features being developed
in object storage, would that be a potential alternative strategy? so
HadoopCatalog users has a way to move forward with still a storage-only
catalog that can run on HDFS, and we can fully deprecate HadoopCatalog.


-Jack

On Tue, Jul 23, 2024 at 10:00 AM Ryan Blue <b...@databricks.com.invalid>
wrote:

> I don't think we would want to put this in a module with other catalog
> implementations. It has serious limitations and is actively discouraged,
> while the other catalog implementations still have value as either REST
> back-end catalogs or as regular catalogs for many users.
>
> On Tue, Jul 23, 2024 at 9:11 AM Jack Ye <yezhao...@gmail.com> wrote:
>
>> For some additional information, we also have some Iceberg HDFS users on
>> EMR. Those are mainly users that have long-running Hadoop and HBase
>> installations. They typically refresh their installation every 1-2 years.
>> From my understanding, they use S3 for data storage, but metadata is kept
>> in the local HDFS cluster, thus HadoopCatalog works well for them.
>>
>> I remember we discussed moving all catalog implementations in the main
>> repo right now to a separated iceberg-catalogs repo. Could we do this move
>> as a part of that effort?
>>
>> -Jack
>>
>> On Tue, Jul 23, 2024 at 8:46 AM Ryan Blue <b...@databricks.com.invalid>
>> wrote:
>>
>>> Thanks for the context, lisoda. I agree that it's good to understand the
>>> issues you're facing with the HadoopCatalog. One follow up question that I
>>> have is what the underlying storage is. Are you using HDFS for those 30,000
>>> customers?
>>>
>>> I think you're right that there is a challenge to migrating. Because
>>> there is no catalog requirement, it's hard to make sure you have all of the
>>> writers migrated. I think that means we do need to have a plan or
>>> recommendation for people currently using this catalog in production, but
>>> it also puts more pressure on us to deprecate this catalog and avoid more
>>> people having this problem.
>>>
>>> I think it's a good idea to make the spec change, which we have
>>> agreement for and to ensure that the FS catalog and table operations are
>>> properly deprecated to show that they should not be used. I'm not sure
>>> whether there is support in the community for moving the implementation
>>> into a new iceberg-hadoop module, but at a minimum we can't just remove
>>> this right away. I think that a separate iceberg-hadoop module would make
>>> the most sense.
>>>
>>> On Thu, Jul 18, 2024 at 11:09 PM lisoda <lis...@yeah.net> wrote:
>>>
>>>> Hi team.
>>>>      I am not a pmc member, just a regular user. Instead of discussing
>>>> whether hadoopcatalog needs to continue to exist, I'd like to share a more
>>>> practical issue.
>>>>
>>>>     We currently serve over 30,000 customers, all of whom use Iceberg
>>>> to store their foundational data, and all business analyses are conducted
>>>> based on Iceberg. However, all the Iceberg tables are hadoop_catalog. At
>>>> least, this has been the case since I started working with our production
>>>> environment system.
>>>>
>>>>     In recent days, I've attempted to migrate hadoop_catalog to
>>>> jdbc-catalog, but I failed. We store 2PB of data, and replacing the current
>>>> catalogues has become an almost impossible task. Users not only create
>>>> hadoop_catalog tables through Spark, they also continuously use third-party
>>>> OLAP systems/FLINK and other means to write data into Iceberg in the form
>>>> of hadoop_catalog. Given this situation, we can only continue to fix
>>>> hadoop_catalog and provide services to customers.
>>>>
>>>>     I understand that the community wants to make a big push into
>>>> rest-catalog, and I agree with the direction the community is going.But 
>>>> considering
>>>> that there might be a significant number of users facing similar issues,
>>>> can we at least retain a module similar to iceberg-hadoop to extend
>>>> hadoop_catalog? If it is removed, we won't be able to continue providing
>>>> services to customers. So, if possible, please consider this option.
>>>>
>>>> Thank you all.
>>>>
>>>> Kind regards,
>>>> lisoda
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> At 2024-07-19 01:28:18, "Jack Ye" <yezhao...@gmail.com> wrote:
>>>>
>>>> Thank you for bringing this up Ryan. I have been also in the camp of
>>>> saying HadoopCatalog is not recommended, but after thinking about this more
>>>> deeply last night, I now have mixed feelings about this topic. Just to
>>>> comment on the reasons you listed first:
>>>>
>>>> * For reason 1 & 2, it looks like the root cause is that people try to
>>>> use HadoopCatalog outside native HDFS because there are HDFS connectors to
>>>> other storages like S3AFileSystem. However, the norm for such usage has
>>>> been that those connectors do not strictly follow HDFS semantics, and it is
>>>> assumed that people acknowledge the implication of such usage and accept
>>>> the risk. For example, S3AFileSystem was there even before S3 was strongly
>>>> consistent, but people have been using that to write files.
>>>>
>>>> * For reason 3, there are multiple catalogs that do not support all
>>>> operations (e.g. Glue for atomic table rename) and people still widely use
>>>> it.
>>>>
>>>> * For reason 4, I see that more as a missing feature. More features
>>>> could definitely be developed in that catalog implementation.
>>>>
>>>> So the key question to me is, how can we prevent people from using
>>>> HadoopCatalog outside native HDFS. We know HadoopCatalog is popular because
>>>> it is a storage only solution. For object storages specifically,
>>>> HadoopCatalog is not suitable for 2 reasons:
>>>>
>>>> (1) file write does not enforce mutual exclusion, thus cannot enforce
>>>> Iceberg optimistic concurrency requirement (a.k.a. cannot do atomic and
>>>> swap)
>>>>
>>>> (2) directory-based design is not preferred in object storage and will
>>>> result in bad performance.
>>>>
>>>> However, now I look at these 2 issues, they are getting outdated.
>>>>
>>>> (1) object storage is starting to enforce file mutual exclusion. GCS
>>>> supports file generation number [1] that increments monotonically, and can
>>>> use x-goog-if-generation-match [2] to perform atomic swap. Similar feature
>>>> [3] exists in Azure Blob Storage. I cannot speak for the S3 team roadmap.
>>>> But Amazon S3 is clearly falling behind in this domain, and with market
>>>> competition, it is very clear that similar features will come in reasonably
>>>> near future.
>>>>
>>>> (2) directory bucket is becoming the norm. Amazon S3 announced
>>>> directory bucket in 2023 re:invent [4], which does not have the same
>>>> performance limitation even if you have very nested folders and many
>>>> objects in a folder. GCS also has a similar feature launched in preview [5]
>>>> right now. Azure also already has this feature since 2021 [6].
>>>>
>>>> With these new developments in the industry, a storage-only Iceberg
>>>> catalog becomes very attractive. It is simple with only one service
>>>> dependency. It can safely perform atomic compare-and-swap. It is performant
>>>> without the need to worry about folder and file organization. If you want
>>>> to add additional features for things like access control, there are also
>>>> integrations like access grant [7] that can be integrated to do it in a
>>>> very scalable way.
>>>>
>>>> I know the direction in the community so far is to go with the REST
>>>> catalog, and I am personally a big advocate for that. However, that
>>>> requires either building a full REST catalog, or choosing a catalog vendor
>>>> that supports REST. There are many capabilities that REST would unlock, but
>>>> those are visions which I expect will take many years down the road for the
>>>> community to continue to drive consensus and build those features. If I am
>>>> the CTO of a small company and I just want an Iceberg data lake(house)
>>>> right now, do I choose REST, or do I choose (or even just build) a
>>>> storage-only Iceberg catalog? I feel I would actually choose the later.
>>>>
>>>> Going back to the discussion points, my current take of this topic is
>>>> that:
>>>>
>>>> (1) +1 for clarifying that HadoopCatalog should only work with HDFS in
>>>> the spec.
>>>>
>>>> (2) +1 if we want to block non-HDFS use cases in HadoopCatalog by
>>>> default (e.g. fail if using S3A), but we should allow a feature flag to
>>>> unblock the usage so that people can use it after understanding the
>>>> implications and risks, just like how people use S3A today.
>>>>
>>>> (3) +0 for removing HadoopCatalog from the core library. It could be in
>>>> a different module like iceberg-hdfs if that is more suitable.
>>>>
>>>> (4) -1 for moving HadoopCatalog to tests, because HDFS is still a valid
>>>> use case for Iceberg. After the measures 1-3 above, people actually having
>>>> a HDFS use case should be able to continue to innovate and optimize the
>>>> HadoopCatalog implementation. Although "HDFS is becoming much less common",
>>>> looking at GitHub issues and discussion forums, it still has a pretty big
>>>> user base.
>>>>
>>>> (5) In general, I propose we separate the discussion of HadoopCatalog
>>>> from a "storage only catalog" that also deals with other object stages when
>>>> evaluating it. With these latest industry developments, we should evaluate
>>>> the direction for building a storage only Iceberg catalog and see if the
>>>> community has an interest in that. I could help raise a thread about it
>>>> after this discussion is closed.
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>> [1]
>>>> https://cloud.google.com/storage/docs/object-versioning#file_restoration_behavior
>>>> [2]
>>>> https://cloud.google.com/storage/docs/xml-api/reference-headers#xgoogifgenerationmatch
>>>> [3]
>>>> https://learn.microsoft.com/en-us/rest/api/storageservices/specifying-conditional-headers-for-blob-service-operations
>>>> [4]
>>>> https://docs.aws.amazon.com/AmazonS3/latest/userguide/directory-buckets-overview.html
>>>> [5] https://cloud.google.com/storage/docs/buckets#enable-hns
>>>> [6]
>>>> https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace
>>>> [7]
>>>> https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-grants.html
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jul 18, 2024 at 7:16 AM Eduard Tudenhöfner <
>>>> etudenhoef...@apache.org> wrote:
>>>>
>>>>> +1 on deprecating now and removing them from the codebase with Iceberg
>>>>> 2.0
>>>>>
>>>>> On Thu, Jul 18, 2024 at 10:40 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> +1 on deprecating the `File System Tables` from spec and
>>>>>> `HadoopCatalog`, `HadoopTableOperations` in code for now
>>>>>> and removing them permanently during 2.0 release.
>>>>>>
>>>>>> For testing we can use `InMemoryCatalog` as others mentioned.
>>>>>>
>>>>>> I am not sure about moving to test or keeping them only for HDFS.
>>>>>> Because, it leads to confusion to existing users of Hadoop catalog.
>>>>>>
>>>>>> I wanted to have it deprecated 2 years ago
>>>>>> <https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1647950504955309>
>>>>>> and I remember that we discussed it in sync that time and left it as it 
>>>>>> is.
>>>>>> Also, when the user brought this up in slack
>>>>>> <https://apache-iceberg.slack.com/archives/C03LG1D563F/p1720075009593789?thread_ts=1719993403.208859&cid=C03LG1D563F>
>>>>>> recently about lockmanager and refactoring the HadoopTableOperations,
>>>>>> I have asked to open this discussion on the mailing list. So, that we
>>>>>> can conclude it once and for all.
>>>>>>
>>>>>> - Ajantha
>>>>>>
>>>>>> On Thu, Jul 18, 2024 at 12:49 PM Fokko Driesprong <fo...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Ryan and others,
>>>>>>>
>>>>>>> Thanks for bringing this up. I would be in favor of removing the
>>>>>>> HadoopTableOperations, mostly because of the reasons that you already
>>>>>>> mentioned, but also about the fact that it is not fully in line with the
>>>>>>> first principles of Iceberg (being object store native) as it uses
>>>>>>> file-listing.
>>>>>>>
>>>>>>> I think we should deprecate the HadoopTables to raise the attention
>>>>>>> of their users. I would be reluctant to move it to test to just use it 
>>>>>>> for
>>>>>>> testing purposes, I'd rather remove it and replace its use in tests with
>>>>>>> the InMemoryCatalog.
>>>>>>>
>>>>>>> Regarding the StaticTable, this is an easy way to have a read-only
>>>>>>> table by directly pointing to the metadata. This also lives in Java 
>>>>>>> under
>>>>>>> StaticTableOperations
>>>>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/StaticTableOperations.java>.
>>>>>>> It isn't a full-blown catalog where you can list {tables,schemas},
>>>>>>> update tables, etc. As ZENOTME pointed out already, it is all up to the
>>>>>>> user, for example, there is no listing of directories to determine which
>>>>>>> tables are in the catalog.
>>>>>>>
>>>>>>> is there a probability that the strategy used by HadoopCatalog is
>>>>>>>> not compatible with the table managed by other catalogs?
>>>>>>>
>>>>>>>
>>>>>>> Yes, so they are different, you can see in the spec the section on File
>>>>>>> System tables
>>>>>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#file-system-tables>,
>>>>>>> is used by the HadoopTable implementation. Whereas the other catalogs
>>>>>>> follow the Metastore Tables
>>>>>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#metastore-tables>
>>>>>>> .
>>>>>>>
>>>>>>> Kind regards,
>>>>>>> Fokko
>>>>>>>
>>>>>>> Op do 18 jul 2024 om 07:19 schreef NOTME ZE <st810918...@gmail.com>:
>>>>>>>
>>>>>>>> According to our requirements, this function is for some users who
>>>>>>>> want to read iceberg tables without relying on any catalogs, I think 
>>>>>>>> the
>>>>>>>> StaticTable may be more flexible and clear in semantics. For 
>>>>>>>> StaticTable,
>>>>>>>> it's the user's responsibility to decide which metadata of the table to
>>>>>>>> read. But for read-only HadoopCatalog, the metadata may be decided by
>>>>>>>> Catalog, is there a probability that the strategy used by 
>>>>>>>> HadoopCatalog is
>>>>>>>> not compatible with the table managed by other catalogs?
>>>>>>>>
>>>>>>>> Renjie Liu <liurenjie2...@gmail.com> 于2024年7月18日周四 11:39写道：
>>>>>>>>
>>>>>>>>> I think there are two ways to do this:
>>>>>>>>> 1. As Xuanwo said, we refactor HadoopCatalog to be read only, and
>>>>>>>>> throw unsupported operation exception for other operations that 
>>>>>>>>> manipulate
>>>>>>>>> tables.
>>>>>>>>> 2. Totally deprecate HadoopCatalog, and add StaticTable as we did
>>>>>>>>> in pyiceberg or iceberg-rust.
>>>>>>>>>
>>>>>>>>> On Thu, Jul 18, 2024 at 11:26 AM Xuanwo <xua...@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> Hi, Renjie
>>>>>>>>>>
>>>>>>>>>> Are you suggesting that we refactor HadoopCatalog as a
>>>>>>>>>> FileSystemCatalog to enable direct reading from file systems like 
>>>>>>>>>> HDFS, S3,
>>>>>>>>>> and Azure Blob Storage? This catalog will be read-only that don't 
>>>>>>>>>> support
>>>>>>>>>> write operations.
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 18, 2024, at 10:23, Renjie Liu wrote:
>>>>>>>>>>
>>>>>>>>>> Hi, Ryan:
>>>>>>>>>>
>>>>>>>>>> Thanks for raising this. I agree that HadoopCatalog is dangerous
>>>>>>>>>> in manipulating tables/catalogs given limitations of different file
>>>>>>>>>> systems. But I see that there are some users who want to read iceberg
>>>>>>>>>> tables without relying on any catalogs, this is also the 
>>>>>>>>>> motivational use
>>>>>>>>>> case of StaticTable in pyiceberg and iceberg-rust, is there similar 
>>>>>>>>>> things
>>>>>>>>>> in java implementation?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 18, 2024 at 7:01 AM Ryan Blue <b...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hey everyone,
>>>>>>>>>>
>>>>>>>>>> There has been some recent discussion about improving
>>>>>>>>>> HadoopTableOperations and the catalog based on those tables, but 
>>>>>>>>>> we've
>>>>>>>>>> discouraged using file system only table (or "hadoop" tables) for 
>>>>>>>>>> years now
>>>>>>>>>> because of major problems:
>>>>>>>>>> * It is only safe to use hadoop tables with HDFS; most local file
>>>>>>>>>> systems, S3, and other common object stores are unsafe
>>>>>>>>>> * Despite not providing atomicity guarantees outside of HDFS,
>>>>>>>>>> people use the tables in unsafe situations
>>>>>>>>>> * HadoopCatalog cannot implement atomic operations for rename and
>>>>>>>>>> drop table, which are commonly used in data engineering
>>>>>>>>>> * Alternative file names (for instance when using metadata file
>>>>>>>>>> compression) also break guarantees
>>>>>>>>>>
>>>>>>>>>> While these tables are useful for testing in non-production
>>>>>>>>>> scenarios, I think it's misleading to have them in the core module 
>>>>>>>>>> because
>>>>>>>>>> there's an appearance that they are a reasonable choice. I propose we
>>>>>>>>>> deprecate the HadoopTableOperations and HadoopCatalog 
>>>>>>>>>> implementations and
>>>>>>>>>> move them to tests the next time we can make breaking API changes 
>>>>>>>>>> (2.0).
>>>>>>>>>>
>>>>>>>>>> I think we should also consider similar fixes to the table spec.
>>>>>>>>>> It currently describes how HadoopTableOperations works, which does 
>>>>>>>>>> not work
>>>>>>>>>> in object stores or local file systems. HDFS is becoming much less 
>>>>>>>>>> common
>>>>>>>>>> and I propose that we note that the strategy in the spec should ONLY 
>>>>>>>>>> be
>>>>>>>>>> used with HDFS.
>>>>>>>>>>
>>>>>>>>>> What do other people think?
>>>>>>>>>>
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Xuanwo
>>>>>>>>>>
>>>>>>>>>> https://xuanwo.io/
>>>>>>>>>>
>>>>>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Databricks
>>>
>>
>
> --
> Ryan Blue
> Databricks
>

Re: Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Reply via email to