My guess would be to avoid complications with multiple committers
attempting to swap at the same time.

On Wed, Jul 31, 2024 at 9:50 AM Jack Ye <yezhao...@gmail.com> wrote:

> I see, thank you Fokko, this is a very helpful context.
>
> Looking at the discussion in the PR and discussions in it, it seems like
> the version hint file is the key problem here. The file system table spec
> [1] is technically correct and only uses a single rename operation to
> perform the atomic commit, and defines that the v<version>.metadata.json is
> the latest file. However the additional write of a version hint file seems
> problematic as that could have additional failures and cause all sorts of
> edge case behaviors, and is not really strictly following the spec.
>
> I agree that if we want to properly follow the current file system table
> spec, then the right way is to stop the commit process after renaming to
> the v<version>.metadata.json, and the reader should perform a listing to
> discover the latest metadata file. If we go with that, this is
> interestingly becoming highly similar to the Delta Lake protocol, where the
> zero-padded log files [2] are discovered using this mechanism [3] I
> believe. And they have implementations for different storage systems
> including HDFS, S3, Azure, GCS, with a pluggable extension point.
>
> One question I have now: what is the motivation in the file system table
> spec to rename the latest table metadata to v<version>.metadata.json,
> rather than just a fixed name like latest.metadata.json? Why is the version
> number in the file name important?
>
> -Jack
>
> [1] https://iceberg.apache.org/spec/#file-system-tables
> [2]
> https://github.com/delta-io/delta/blob/master/PROTOCOL.md#delta-log-entries
> [3]
> https://github.com/delta-io/delta/blob/master/storage/src/main/java/io/delta/storage/LogStore.java#L116
>
>
>
> On Tue, Jul 30, 2024 at 10:52 PM Fokko Driesprong <fo...@apache.org>
> wrote:
>
>> Jack,
>>
>> no atomic drop table support: this seems pretty fixable, as you can
>>> change the semantics of dropping a table to be deleting the latest table
>>> version hint file, instead of having to delete everything in the folder. I
>>> feel that actually also fits the semantics of purge/no-purge better.
>>
>>
>> I would invite you to check out lisoda's PR
>> <https://github.com/apache/iceberg/pulls/BsoBird> (#9546
>> <https://github.com/apache/iceberg/pull/9546> is an earlier version with
>> more discussion) that works towards removing the version hint file to avoid
>> discrepancies between the latest committed metadata and the metadata that's
>> referenced in the hint file. These can go out of sync since the operation
>> there is not atomic. Removing this introduces other problems where you have
>> to determine the latest version of the metadata using prefix-listing, which
>> is not efficient and desirable on an object store as you already mentioned.
>>
>> Kind regards,
>> Fokko
>>
>> Op wo 31 jul 2024 om 04:39 schreef Jack Ye <yezhao...@gmail.com>:
>>
>>> Atomicity is just one requirement, and it also needs to be efficient,
>>> desirably a metadata-only operation.
>>>
>>> Looking at some documentations of GCS [1], the rename operation is still
>>> a COPY + DELETE behind the scene unless it is a hierarchical namespace
>>> bucket. The Azure documentation [2] also suggests that the fast rename
>>> feature is only available with hierarchical namespace that is for the Gen2
>>> buckets. I found little documentation about the exact rename guarantee and
>>> semantics of ADLS though. But it is undeniable that at least GCS and Azure
>>> should be able to work with HadoopCatalog pretty well with their latest
>>> offerings.
>>>
>>> Steve, if you could share more insights to this and related
>>> documentations, that would be really appreciated.
>>>
>>> -Jack
>>>
>>> [1] https://cloud.google.com/storage/docs/rename-hns-folders
>>> [2]
>>> https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace#the-benefits-of-a-hierarchical-namespace
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jul 30, 2024 at 11:11 AM Steve Loughran
>>> <ste...@cloudera.com.invalid> wrote:
>>>
>>>>
>>>>
>>>> On Thu, 18 Jul 2024 at 00:02, Ryan Blue <b...@apache.org> wrote:
>>>>
>>>>> Hey everyone,
>>>>>
>>>>> There has been some recent discussion about improving
>>>>> HadoopTableOperations and the catalog based on those tables, but we've
>>>>> discouraged using file system only table (or "hadoop" tables) for years 
>>>>> now
>>>>> because of major problems:
>>>>> * It is only safe to use hadoop tables with HDFS; most local file
>>>>> systems, S3, and other common object stores are unsafe
>>>>>
>>>>
>>>> Azure storage and linux local filesystems all support atomic file and
>>>> dir rename an delete; google gcs does it for files and dirs only. Windows,
>>>> well, anybody who claims to understand the semantics of MoveFile is
>>>> probably wrong (
>>>> https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-movefilewithprogressw
>>>> )
>>>>
>>>> * Despite not providing atomicity guarantees outside of HDFS, people
>>>>> use the tables in unsafe situations
>>>>>
>>>>
>>>> which means "s3", unless something needs directory rename
>>>>
>>>>
>>>>> * HadoopCatalog cannot implement atomic operations for rename and drop
>>>>> table, which are commonly used in data engineering
>>>>> * Alternative file names (for instance when using metadata file
>>>>> compression) also break guarantees
>>>>>
>>>>> While these tables are useful for testing in non-production scenarios,
>>>>> I think it's misleading to have them in the core module because there's an
>>>>> appearance that they are a reasonable choice. I propose we deprecate the
>>>>> HadoopTableOperations and HadoopCatalog implementations and move them to
>>>>> tests the next time we can make breaking API changes (2.0).
>>>>>
>>>>> I think we should also consider similar fixes to the table spec. It
>>>>> currently describes how HadoopTableOperations works, which does not work 
>>>>> in
>>>>> object stores or local file systems. HDFS is becoming much less common and
>>>>> I propose that we note that the strategy in the spec should ONLY be used
>>>>> with HDFS.
>>>>>
>>>>> What do other people think?
>>>>>
>>>>> Ryan
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>>
>>>>

Reply via email to