Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

John Zhuge Thu, 18 Jul 2024 11:47:59 -0700

Appreciate the thoughtful comments!




On Thu, Jul 18, 2024 at 10:29 AM Jack Ye <yezhao...@gmail.com> wrote:

> Thank you for bringing this up Ryan. I have been also in the camp of
> saying HadoopCatalog is not recommended, but after thinking about this more
> deeply last night, I now have mixed feelings about this topic. Just to
> comment on the reasons you listed first:
>
> * For reason 1 & 2, it looks like the root cause is that people try to use
> HadoopCatalog outside native HDFS because there are HDFS connectors to
> other storages like S3AFileSystem. However, the norm for such usage has
> been that those connectors do not strictly follow HDFS semantics, and it is
> assumed that people acknowledge the implication of such usage and accept
> the risk. For example, S3AFileSystem was there even before S3 was strongly
> consistent, but people have been using that to write files.
>
> * For reason 3, there are multiple catalogs that do not support all
> operations (e.g. Glue for atomic table rename) and people still widely use
> it.
>
> * For reason 4, I see that more as a missing feature. More features could
> definitely be developed in that catalog implementation.
>
> So the key question to me is, how can we prevent people from using
> HadoopCatalog outside native HDFS. We know HadoopCatalog is popular because
> it is a storage only solution. For object storages specifically,
> HadoopCatalog is not suitable for 2 reasons:
>
> (1) file write does not enforce mutual exclusion, thus cannot enforce
> Iceberg optimistic concurrency requirement (a.k.a. cannot do atomic and
> swap)
>
> (2) directory-based design is not preferred in object storage and will
> result in bad performance.
>
> However, now I look at these 2 issues, they are getting outdated.
>
> (1) object storage is starting to enforce file mutual exclusion. GCS
> supports file generation number [1] that increments monotonically, and can
> use x-goog-if-generation-match [2] to perform atomic swap. Similar feature
> [3] exists in Azure Blob Storage. I cannot speak for the S3 team roadmap.
> But Amazon S3 is clearly falling behind in this domain, and with market
> competition, it is very clear that similar features will come in reasonably
> near future.
>
> (2) directory bucket is becoming the norm. Amazon S3 announced directory
> bucket in 2023 re:invent [4], which does not have the same performance
> limitation even if you have very nested folders and many objects in a
> folder. GCS also has a similar feature launched in preview [5] right now.
> Azure also already has this feature since 2021 [6].
>
> With these new developments in the industry, a storage-only Iceberg
> catalog becomes very attractive. It is simple with only one service
> dependency. It can safely perform atomic compare-and-swap. It is performant
> without the need to worry about folder and file organization. If you want
> to add additional features for things like access control, there are also
> integrations like access grant [7] that can be integrated to do it in a
> very scalable way.
>
> I know the direction in the community so far is to go with the REST
> catalog, and I am personally a big advocate for that. However, that
> requires either building a full REST catalog, or choosing a catalog vendor
> that supports REST. There are many capabilities that REST would unlock, but
> those are visions which I expect will take many years down the road for the
> community to continue to drive consensus and build those features. If I am
> the CTO of a small company and I just want an Iceberg data lake(house)
> right now, do I choose REST, or do I choose (or even just build) a
> storage-only Iceberg catalog? I feel I would actually choose the later.
>
> Going back to the discussion points, my current take of this topic is that:
>
> (1) +1 for clarifying that HadoopCatalog should only work with HDFS in the
> spec.
>
> (2) +1 if we want to block non-HDFS use cases in HadoopCatalog by default
> (e.g. fail if using S3A), but we should allow a feature flag to unblock the
> usage so that people can use it after understanding the implications and
> risks, just like how people use S3A today.
>
> (3) +0 for removing HadoopCatalog from the core library. It could be in a
> different module like iceberg-hdfs if that is more suitable.
>
> (4) -1 for moving HadoopCatalog to tests, because HDFS is still a valid
> use case for Iceberg. After the measures 1-3 above, people actually having
> a HDFS use case should be able to continue to innovate and optimize the
> HadoopCatalog implementation. Although "HDFS is becoming much less common",
> looking at GitHub issues and discussion forums, it still has a pretty big
> user base.
>
> (5) In general, I propose we separate the discussion of HadoopCatalog from
> a "storage only catalog" that also deals with other object stages when
> evaluating it. With these latest industry developments, we should evaluate
> the direction for building a storage only Iceberg catalog and see if the
> community has an interest in that. I could help raise a thread about it
> after this discussion is closed.
>
> Best,
> Jack Ye
>
> [1]
> https://cloud.google.com/storage/docs/object-versioning#file_restoration_behavior
> [2]
> https://cloud.google.com/storage/docs/xml-api/reference-headers#xgoogifgenerationmatch
> [3]
> https://learn.microsoft.com/en-us/rest/api/storageservices/specifying-conditional-headers-for-blob-service-operations
> [4]
> https://docs.aws.amazon.com/AmazonS3/latest/userguide/directory-buckets-overview.html
> [5] https://cloud.google.com/storage/docs/buckets#enable-hns
> [6]
> https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace
> [7]
> https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-grants.html
>
>
>
>
>
>
> On Thu, Jul 18, 2024 at 7:16 AM Eduard Tudenhöfner <
> etudenhoef...@apache.org> wrote:
>
>> +1 on deprecating now and removing them from the codebase with Iceberg 2.0
>>
>> On Thu, Jul 18, 2024 at 10:40 AM Ajantha Bhat <ajanthab...@gmail.com>
>> wrote:
>>
>>> +1 on deprecating the `File System Tables` from spec and
>>> `HadoopCatalog`, `HadoopTableOperations` in code for now
>>> and removing them permanently during 2.0 release.
>>>
>>> For testing we can use `InMemoryCatalog` as others mentioned.
>>>
>>> I am not sure about moving to test or keeping them only for HDFS.
>>> Because, it leads to confusion to existing users of Hadoop catalog.
>>>
>>> I wanted to have it deprecated 2 years ago
>>> <https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1647950504955309>
>>> and I remember that we discussed it in sync that time and left it as it is.
>>> Also, when the user brought this up in slack
>>> <https://apache-iceberg.slack.com/archives/C03LG1D563F/p1720075009593789?thread_ts=1719993403.208859&cid=C03LG1D563F>
>>> recently about lockmanager and refactoring the HadoopTableOperations,
>>> I have asked to open this discussion on the mailing list. So, that we
>>> can conclude it once and for all.
>>>
>>> - Ajantha
>>>
>>> On Thu, Jul 18, 2024 at 12:49 PM Fokko Driesprong <fo...@apache.org>
>>> wrote:
>>>
>>>> Hey Ryan and others,
>>>>
>>>> Thanks for bringing this up. I would be in favor of removing the
>>>> HadoopTableOperations, mostly because of the reasons that you already
>>>> mentioned, but also about the fact that it is not fully in line with the
>>>> first principles of Iceberg (being object store native) as it uses
>>>> file-listing.
>>>>
>>>> I think we should deprecate the HadoopTables to raise the attention of
>>>> their users. I would be reluctant to move it to test to just use it for
>>>> testing purposes, I'd rather remove it and replace its use in tests with
>>>> the InMemoryCatalog.
>>>>
>>>> Regarding the StaticTable, this is an easy way to have a read-only
>>>> table by directly pointing to the metadata. This also lives in Java under
>>>> StaticTableOperations
>>>> <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/StaticTableOperations.java>.
>>>> It isn't a full-blown catalog where you can list {tables,schemas},
>>>> update tables, etc. As ZENOTME pointed out already, it is all up to the
>>>> user, for example, there is no listing of directories to determine which
>>>> tables are in the catalog.
>>>>
>>>> is there a probability that the strategy used by HadoopCatalog is not
>>>>> compatible with the table managed by other catalogs?
>>>>
>>>>
>>>> Yes, so they are different, you can see in the spec the section on File
>>>> System tables
>>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#file-system-tables>,
>>>> is used by the HadoopTable implementation. Whereas the other catalogs
>>>> follow the Metastore Tables
>>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#metastore-tables>
>>>> .
>>>>
>>>> Kind regards,
>>>> Fokko
>>>>
>>>> Op do 18 jul 2024 om 07:19 schreef NOTME ZE <st810918...@gmail.com>:
>>>>
>>>>> According to our requirements, this function is for some users who
>>>>> want to read iceberg tables without relying on any catalogs, I think the
>>>>> StaticTable may be more flexible and clear in semantics. For StaticTable,
>>>>> it's the user's responsibility to decide which metadata of the table to
>>>>> read. But for read-only HadoopCatalog, the metadata may be decided by
>>>>> Catalog, is there a probability that the strategy used by HadoopCatalog is
>>>>> not compatible with the table managed by other catalogs?
>>>>>
>>>>> Renjie Liu <liurenjie2...@gmail.com> 于2024年7月18日周四 11:39写道：
>>>>>
>>>>>> I think there are two ways to do this:
>>>>>> 1. As Xuanwo said, we refactor HadoopCatalog to be read only, and
>>>>>> throw unsupported operation exception for other operations that 
>>>>>> manipulate
>>>>>> tables.
>>>>>> 2. Totally deprecate HadoopCatalog, and add StaticTable as we did in
>>>>>> pyiceberg or iceberg-rust.
>>>>>>
>>>>>> On Thu, Jul 18, 2024 at 11:26 AM Xuanwo <xua...@apache.org> wrote:
>>>>>>
>>>>>>> Hi, Renjie
>>>>>>>
>>>>>>> Are you suggesting that we refactor HadoopCatalog as a
>>>>>>> FileSystemCatalog to enable direct reading from file systems like HDFS, 
>>>>>>> S3,
>>>>>>> and Azure Blob Storage? This catalog will be read-only that don't 
>>>>>>> support
>>>>>>> write operations.
>>>>>>>
>>>>>>> On Thu, Jul 18, 2024, at 10:23, Renjie Liu wrote:
>>>>>>>
>>>>>>> Hi, Ryan:
>>>>>>>
>>>>>>> Thanks for raising this. I agree that HadoopCatalog is dangerous in
>>>>>>> manipulating tables/catalogs given limitations of different file 
>>>>>>> systems.
>>>>>>> But I see that there are some users who want to read iceberg tables 
>>>>>>> without
>>>>>>> relying on any catalogs, this is also the motivational use case of
>>>>>>> StaticTable in pyiceberg and iceberg-rust, is there similar things in 
>>>>>>> java
>>>>>>> implementation?
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 18, 2024 at 7:01 AM Ryan Blue <b...@apache.org> wrote:
>>>>>>>
>>>>>>> Hey everyone,
>>>>>>>
>>>>>>> There has been some recent discussion about improving
>>>>>>> HadoopTableOperations and the catalog based on those tables, but we've
>>>>>>> discouraged using file system only table (or "hadoop" tables) for years 
>>>>>>> now
>>>>>>> because of major problems:
>>>>>>> * It is only safe to use hadoop tables with HDFS; most local file
>>>>>>> systems, S3, and other common object stores are unsafe
>>>>>>> * Despite not providing atomicity guarantees outside of HDFS, people
>>>>>>> use the tables in unsafe situations
>>>>>>> * HadoopCatalog cannot implement atomic operations for rename and
>>>>>>> drop table, which are commonly used in data engineering
>>>>>>> * Alternative file names (for instance when using metadata file
>>>>>>> compression) also break guarantees
>>>>>>>
>>>>>>> While these tables are useful for testing in non-production
>>>>>>> scenarios, I think it's misleading to have them in the core module 
>>>>>>> because
>>>>>>> there's an appearance that they are a reasonable choice. I propose we
>>>>>>> deprecate the HadoopTableOperations and HadoopCatalog implementations 
>>>>>>> and
>>>>>>> move them to tests the next time we can make breaking API changes (2.0).
>>>>>>>
>>>>>>> I think we should also consider similar fixes to the table spec. It
>>>>>>> currently describes how HadoopTableOperations works, which does not 
>>>>>>> work in
>>>>>>> object stores or local file systems. HDFS is becoming much less common 
>>>>>>> and
>>>>>>> I propose that we note that the strategy in the spec should ONLY be used
>>>>>>> with HDFS.
>>>>>>>
>>>>>>> What do other people think?
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>>
>>>>>>>
>>>>>>> Xuanwo
>>>>>>>
>>>>>>> https://xuanwo.io/
>>>>>>>
>>>>>>>

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Reply via email to