Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

NOTME ZE Wed, 17 Jul 2024 22:19:43 -0700

According to our requirements, this function is for some users who want to
read iceberg tables without relying on any catalogs, I think the
StaticTable may be more flexible and clear in semantics. For StaticTable,
it's the user's responsibility to decide which metadata of the table to
read. But for read-only HadoopCatalog, the metadata may be decided by
Catalog, is there a probability that the strategy used by HadoopCatalog is
not compatible with the table managed by other catalogs?


Renjie Liu <liurenjie2...@gmail.com> 于2024年7月18日周四 11:39写道：

> I think there are two ways to do this:
> 1. As Xuanwo said, we refactor HadoopCatalog to be read only, and throw
> unsupported operation exception for other operations that manipulate tables.
> 2. Totally deprecate HadoopCatalog, and add StaticTable as we did in
> pyiceberg or iceberg-rust.
>
> On Thu, Jul 18, 2024 at 11:26 AM Xuanwo <xua...@apache.org> wrote:
>
>> Hi, Renjie
>>
>> Are you suggesting that we refactor HadoopCatalog as a FileSystemCatalog
>> to enable direct reading from file systems like HDFS, S3, and Azure Blob
>> Storage? This catalog will be read-only that don't support write operations.
>>
>> On Thu, Jul 18, 2024, at 10:23, Renjie Liu wrote:
>>
>> Hi, Ryan:
>>
>> Thanks for raising this. I agree that HadoopCatalog is dangerous in
>> manipulating tables/catalogs given limitations of different file systems.
>> But I see that there are some users who want to read iceberg tables without
>> relying on any catalogs, this is also the motivational use case of
>> StaticTable in pyiceberg and iceberg-rust, is there similar things in java
>> implementation?
>>
>>
>> On Thu, Jul 18, 2024 at 7:01 AM Ryan Blue <b...@apache.org> wrote:
>>
>> Hey everyone,
>>
>> There has been some recent discussion about improving
>> HadoopTableOperations and the catalog based on those tables, but we've
>> discouraged using file system only table (or "hadoop" tables) for years now
>> because of major problems:
>> * It is only safe to use hadoop tables with HDFS; most local file
>> systems, S3, and other common object stores are unsafe
>> * Despite not providing atomicity guarantees outside of HDFS, people use
>> the tables in unsafe situations
>> * HadoopCatalog cannot implement atomic operations for rename and drop
>> table, which are commonly used in data engineering
>> * Alternative file names (for instance when using metadata file
>> compression) also break guarantees
>>
>> While these tables are useful for testing in non-production scenarios, I
>> think it's misleading to have them in the core module because there's an
>> appearance that they are a reasonable choice. I propose we deprecate the
>> HadoopTableOperations and HadoopCatalog implementations and move them to
>> tests the next time we can make breaking API changes (2.0).
>>
>> I think we should also consider similar fixes to the table spec. It
>> currently describes how HadoopTableOperations works, which does not work in
>> object stores or local file systems. HDFS is becoming much less common and
>> I propose that we note that the strategy in the spec should ONLY be used
>> with HDFS.
>>
>> What do other people think?
>>
>> Ryan
>>
>> --
>> Ryan Blue
>>
>>
>> Xuanwo
>>
>> https://xuanwo.io/
>>
>>

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Reply via email to