Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Jack Ye Wed, 31 Jul 2024 08:05:31 -0700

Oh I remember now, I think it was because HDFS semantics of rename fails
when a file already exists. However, I think in the latest HDFS with
FileContext API, an OVERWRITE flag can be passed to the context to make the
rename succeed [1]:


> If OVERWRITE option is not passed as an argument, rename fails if the dst
already exists.
> If OVERWRITE option is passed as an argument, rename overwrites the dst
if it is a file or an empty directory. Rename fails if dst is a non-empty
directory.

-Jack

[1]
https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileContext.html#rename-org.apache.hadoop.fs.Path-org.apache.hadoop.fs.Path-org.apache.hadoop.fs.Options.Rename...-


On Wed, Jul 31, 2024 at 8:04 AM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> My guess would be to avoid complications with multiple committers
> attempting to swap at the same time.
>
> On Wed, Jul 31, 2024 at 9:50 AM Jack Ye <yezhao...@gmail.com> wrote:
>
>> I see, thank you Fokko, this is a very helpful context.
>>
>> Looking at the discussion in the PR and discussions in it, it seems like
>> the version hint file is the key problem here. The file system table spec
>> [1] is technically correct and only uses a single rename operation to
>> perform the atomic commit, and defines that the v<version>.metadata.json is
>> the latest file. However the additional write of a version hint file seems
>> problematic as that could have additional failures and cause all sorts of
>> edge case behaviors, and is not really strictly following the spec.
>>
>> I agree that if we want to properly follow the current file system table
>> spec, then the right way is to stop the commit process after renaming to
>> the v<version>.metadata.json, and the reader should perform a listing to
>> discover the latest metadata file. If we go with that, this is
>> interestingly becoming highly similar to the Delta Lake protocol, where the
>> zero-padded log files [2] are discovered using this mechanism [3] I
>> believe. And they have implementations for different storage systems
>> including HDFS, S3, Azure, GCS, with a pluggable extension point.
>>
>> One question I have now: what is the motivation in the file system table
>> spec to rename the latest table metadata to v<version>.metadata.json,
>> rather than just a fixed name like latest.metadata.json? Why is the version
>> number in the file name important?
>>
>> -Jack
>>
>> [1] https://iceberg.apache.org/spec/#file-system-tables
>> [2]
>> https://github.com/delta-io/delta/blob/master/PROTOCOL.md#delta-log-entries
>> [3]
>> https://github.com/delta-io/delta/blob/master/storage/src/main/java/io/delta/storage/LogStore.java#L116
>>
>>
>>
>> On Tue, Jul 30, 2024 at 10:52 PM Fokko Driesprong <fo...@apache.org>
>> wrote:
>>
>>> Jack,
>>>
>>> no atomic drop table support: this seems pretty fixable, as you can
>>>> change the semantics of dropping a table to be deleting the latest table
>>>> version hint file, instead of having to delete everything in the folder. I
>>>> feel that actually also fits the semantics of purge/no-purge better.
>>>
>>>
>>> I would invite you to check out lisoda's PR
>>> <https://github.com/apache/iceberg/pulls/BsoBird> (#9546
>>> <https://github.com/apache/iceberg/pull/9546> is an earlier version
>>> with more discussion) that works towards removing the version hint file to
>>> avoid discrepancies between the latest committed metadata and the metadata
>>> that's referenced in the hint file. These can go out of sync since the
>>> operation there is not atomic. Removing this introduces other problems
>>> where you have to determine the latest version of the metadata using
>>> prefix-listing, which is not efficient and desirable on an object store as
>>> you already mentioned.
>>>
>>> Kind regards,
>>> Fokko
>>>
>>> Op wo 31 jul 2024 om 04:39 schreef Jack Ye <yezhao...@gmail.com>:
>>>
>>>> Atomicity is just one requirement, and it also needs to be efficient,
>>>> desirably a metadata-only operation.
>>>>
>>>> Looking at some documentations of GCS [1], the rename operation is
>>>> still a COPY + DELETE behind the scene unless it is a hierarchical
>>>> namespace bucket. The Azure documentation [2] also suggests that the fast
>>>> rename feature is only available with hierarchical namespace that is for
>>>> the Gen2 buckets. I found little documentation about the exact rename
>>>> guarantee and semantics of ADLS though. But it is undeniable that at least
>>>> GCS and Azure should be able to work with HadoopCatalog pretty well with
>>>> their latest offerings.
>>>>
>>>> Steve, if you could share more insights to this and related
>>>> documentations, that would be really appreciated.
>>>>
>>>> -Jack
>>>>
>>>> [1] https://cloud.google.com/storage/docs/rename-hns-folders
>>>> [2]
>>>> https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace#the-benefits-of-a-hierarchical-namespace
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jul 30, 2024 at 11:11 AM Steve Loughran
>>>> <ste...@cloudera.com.invalid> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, 18 Jul 2024 at 00:02, Ryan Blue <b...@apache.org> wrote:
>>>>>
>>>>>> Hey everyone,
>>>>>>
>>>>>> There has been some recent discussion about improving
>>>>>> HadoopTableOperations and the catalog based on those tables, but we've
>>>>>> discouraged using file system only table (or "hadoop" tables) for years 
>>>>>> now
>>>>>> because of major problems:
>>>>>> * It is only safe to use hadoop tables with HDFS; most local file
>>>>>> systems, S3, and other common object stores are unsafe
>>>>>>
>>>>>
>>>>> Azure storage and linux local filesystems all support atomic file and
>>>>> dir rename an delete; google gcs does it for files and dirs only. Windows,
>>>>> well, anybody who claims to understand the semantics of MoveFile is
>>>>> probably wrong (
>>>>> https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-movefilewithprogressw
>>>>> )
>>>>>
>>>>> * Despite not providing atomicity guarantees outside of HDFS, people
>>>>>> use the tables in unsafe situations
>>>>>>
>>>>>
>>>>> which means "s3", unless something needs directory rename
>>>>>
>>>>>
>>>>>> * HadoopCatalog cannot implement atomic operations for rename and
>>>>>> drop table, which are commonly used in data engineering
>>>>>> * Alternative file names (for instance when using metadata file
>>>>>> compression) also break guarantees
>>>>>>
>>>>>> While these tables are useful for testing in non-production
>>>>>> scenarios, I think it's misleading to have them in the core module 
>>>>>> because
>>>>>> there's an appearance that they are a reasonable choice. I propose we
>>>>>> deprecate the HadoopTableOperations and HadoopCatalog implementations and
>>>>>> move them to tests the next time we can make breaking API changes (2.0).
>>>>>>
>>>>>> I think we should also consider similar fixes to the table spec. It
>>>>>> currently describes how HadoopTableOperations works, which does not work 
>>>>>> in
>>>>>> object stores or local file systems. HDFS is becoming much less common 
>>>>>> and
>>>>>> I propose that we note that the strategy in the spec should ONLY be used
>>>>>> with HDFS.
>>>>>>
>>>>>> What do other people think?
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>>
>>>>>

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

Reply via email to