I think “is it worth it” is the right question. We *could* use putIfAbsent
in some back-ends, exclusive file creation in others, atomic renames in
some more. We *could* come up with an API that abstracts those things. In
the end we would end up with a situation in which:

   - These operations need to be designed, implemented, and added to a spec
   for each backing file store
   - Catalog behavior for important functions like drop and rename is
   inconsistent and depends on the capabilities of the underlying store

I don’t think that work is worth it.

One of the critical things that we set out to do in Iceberg was to make
sharp edges go away by restoring the table abstraction. Things should just
work so that users don’t need to make choices about what to use in a table
based on underlying details like file format. I think this choice was one
of the critical factors in Iceberg’s success and I’ve talked about it for
years. Making the spec significantly more complicated only to deliver
inconsistent behavior across file storage systems is not a good idea
because it is incompatible with this core tenet.

I think that we want to support lisoda and others in the short term by not
abruptly dropping code they depend on, but I also think our efforts are
best spent finding a good path for those users to migrate to catalogs.

Ryan

On Wed, Jul 31, 2024 at 8:37 AM Jack Ye <yezhao...@gmail.com> wrote:

> I guess the problem with an OVERWRITE flag for rename is that, with this
> flag, file mutual exclusion seems to be more difficult to enforce, and the
> difference among file systems becomes really nuanced. If 2 writers both
> have OVERWRITE flag on, then it seems like the file system should just let
> one clobber the other, because that fits the behavior of "overwriting". But
> I don't know enough about native HDFS behavior to comment on it.
>
> But overall, I think we have a pretty good idea about what the exact
> problem is now. The HadoopCatalog/HadoopTableOperation at this moment does
> have an atomicity problem, regardless of the file system it is interacting
> with.
>
> The best way I see to fix it at this moment is to treat version-hint like
> a Delta _last_checkpoint [1] file, which is a best-effort write. It only
> indicates the starting version to find the latest version. And the version
> files should probably be zero-padded, so the reader should try to read
> version-hint, and then do a list starting at that hinted version to find
> the latest metadata version. I think the questions are (1) do we want to
> fix it in this way, and (2) is it worth it for the community to maintain
> such an implementation.
>
> And going beyond rename, I think the right semantics to expose if we
> really want a storage-based Iceberg catalog solution beyond the current
> file system table spec, is to introduce semantics like
>
> - atomic_write(file_name, content, last_file_info), where last_file_info
> can be things like generation ID, version number, checksum, etc. depending
> on system implementation
> - atomic_create(file_name, content)
>
> Which can be leveraged by GCS and Azure, and maybe S3 in the future, to
> directly atomically swap the latest metadata file of the same name, instead
> of going with the HFDS's rename to new version file approach. And the
> questions in this route become, (1) do we think it is worth building a
> storage-based catalog with such abstraction (e.g. FileIOCatalog?
> OpenDALCatalog?), or people that want a storage-only solution can just
> build dedicated solutions for GCS, Azure, etc. that are not strictly
> portable across platforms? (2) Again, is it worth it for the community to
> maintain such an implementation?
>
> Curious what people think.
>
> -Jack
>
> [1]
> https://github.com/delta-io/delta/blob/master/PROTOCOL.md#last-checkpoint-file
>
>
>
>
> On Wed, Jul 31, 2024 at 8:05 AM Jack Ye <yezhao...@gmail.com> wrote:
>
>> Oh I remember now, I think it was because HDFS semantics of rename fails
>> when a file already exists. However, I think in the latest HDFS with
>> FileContext API, an OVERWRITE flag can be passed to the context to make the
>> rename succeed [1]:
>>
>> > If OVERWRITE option is not passed as an argument, rename fails if the
>> dst already exists.
>> > If OVERWRITE option is passed as an argument, rename overwrites the dst
>> if it is a file or an empty directory. Rename fails if dst is a non-empty
>> directory.
>>
>> -Jack
>>
>> [1]
>> https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileContext.html#rename-org.apache.hadoop.fs.Path-org.apache.hadoop.fs.Path-org.apache.hadoop.fs.Options.Rename...-
>>
>>
>> On Wed, Jul 31, 2024 at 8:04 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> My guess would be to avoid complications with multiple committers
>>> attempting to swap at the same time.
>>>
>>> On Wed, Jul 31, 2024 at 9:50 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>>> I see, thank you Fokko, this is a very helpful context.
>>>>
>>>> Looking at the discussion in the PR and discussions in it, it seems
>>>> like the version hint file is the key problem here. The file system table
>>>> spec [1] is technically correct and only uses a single rename operation to
>>>> perform the atomic commit, and defines that the v<version>.metadata.json is
>>>> the latest file. However the additional write of a version hint file seems
>>>> problematic as that could have additional failures and cause all sorts of
>>>> edge case behaviors, and is not really strictly following the spec.
>>>>
>>>> I agree that if we want to properly follow the current file system
>>>> table spec, then the right way is to stop the commit process after renaming
>>>> to the v<version>.metadata.json, and the reader should perform a listing to
>>>> discover the latest metadata file. If we go with that, this is
>>>> interestingly becoming highly similar to the Delta Lake protocol, where the
>>>> zero-padded log files [2] are discovered using this mechanism [3] I
>>>> believe. And they have implementations for different storage systems
>>>> including HDFS, S3, Azure, GCS, with a pluggable extension point.
>>>>
>>>> One question I have now: what is the motivation in the file system
>>>> table spec to rename the latest table metadata to v<version>.metadata.json,
>>>> rather than just a fixed name like latest.metadata.json? Why is the version
>>>> number in the file name important?
>>>>
>>>> -Jack
>>>>
>>>> [1] https://iceberg.apache.org/spec/#file-system-tables
>>>> [2]
>>>> https://github.com/delta-io/delta/blob/master/PROTOCOL.md#delta-log-entries
>>>> [3]
>>>> https://github.com/delta-io/delta/blob/master/storage/src/main/java/io/delta/storage/LogStore.java#L116
>>>>
>>>>
>>>>
>>>> On Tue, Jul 30, 2024 at 10:52 PM Fokko Driesprong <fo...@apache.org>
>>>> wrote:
>>>>
>>>>> Jack,
>>>>>
>>>>> no atomic drop table support: this seems pretty fixable, as you can
>>>>>> change the semantics of dropping a table to be deleting the latest table
>>>>>> version hint file, instead of having to delete everything in the folder. 
>>>>>> I
>>>>>> feel that actually also fits the semantics of purge/no-purge better.
>>>>>
>>>>>
>>>>> I would invite you to check out lisoda's PR
>>>>> <https://github.com/apache/iceberg/pulls/BsoBird> (#9546
>>>>> <https://github.com/apache/iceberg/pull/9546> is an earlier version
>>>>> with more discussion) that works towards removing the version hint file to
>>>>> avoid discrepancies between the latest committed metadata and the metadata
>>>>> that's referenced in the hint file. These can go out of sync since the
>>>>> operation there is not atomic. Removing this introduces other problems
>>>>> where you have to determine the latest version of the metadata using
>>>>> prefix-listing, which is not efficient and desirable on an object store as
>>>>> you already mentioned.
>>>>>
>>>>> Kind regards,
>>>>> Fokko
>>>>>
>>>>> Op wo 31 jul 2024 om 04:39 schreef Jack Ye <yezhao...@gmail.com>:
>>>>>
>>>>>> Atomicity is just one requirement, and it also needs to be efficient,
>>>>>> desirably a metadata-only operation.
>>>>>>
>>>>>> Looking at some documentations of GCS [1], the rename operation is
>>>>>> still a COPY + DELETE behind the scene unless it is a hierarchical
>>>>>> namespace bucket. The Azure documentation [2] also suggests that the fast
>>>>>> rename feature is only available with hierarchical namespace that is for
>>>>>> the Gen2 buckets. I found little documentation about the exact rename
>>>>>> guarantee and semantics of ADLS though. But it is undeniable that at 
>>>>>> least
>>>>>> GCS and Azure should be able to work with HadoopCatalog pretty well with
>>>>>> their latest offerings.
>>>>>>
>>>>>> Steve, if you could share more insights to this and related
>>>>>> documentations, that would be really appreciated.
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>> [1] https://cloud.google.com/storage/docs/rename-hns-folders
>>>>>> [2]
>>>>>> https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace#the-benefits-of-a-hierarchical-namespace
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 30, 2024 at 11:11 AM Steve Loughran
>>>>>> <ste...@cloudera.com.invalid> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 18 Jul 2024 at 00:02, Ryan Blue <b...@apache.org> wrote:
>>>>>>>
>>>>>>>> Hey everyone,
>>>>>>>>
>>>>>>>> There has been some recent discussion about improving
>>>>>>>> HadoopTableOperations and the catalog based on those tables, but we've
>>>>>>>> discouraged using file system only table (or "hadoop" tables) for 
>>>>>>>> years now
>>>>>>>> because of major problems:
>>>>>>>> * It is only safe to use hadoop tables with HDFS; most local file
>>>>>>>> systems, S3, and other common object stores are unsafe
>>>>>>>>
>>>>>>>
>>>>>>> Azure storage and linux local filesystems all support atomic file
>>>>>>> and dir rename an delete; google gcs does it for files and dirs only.
>>>>>>> Windows, well, anybody who claims to understand the semantics of 
>>>>>>> MoveFile
>>>>>>> is probably wrong (
>>>>>>> https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-movefilewithprogressw
>>>>>>> )
>>>>>>>
>>>>>>> * Despite not providing atomicity guarantees outside of HDFS, people
>>>>>>>> use the tables in unsafe situations
>>>>>>>>
>>>>>>>
>>>>>>> which means "s3", unless something needs directory rename
>>>>>>>
>>>>>>>
>>>>>>>> * HadoopCatalog cannot implement atomic operations for rename and
>>>>>>>> drop table, which are commonly used in data engineering
>>>>>>>> * Alternative file names (for instance when using metadata file
>>>>>>>> compression) also break guarantees
>>>>>>>>
>>>>>>>> While these tables are useful for testing in non-production
>>>>>>>> scenarios, I think it's misleading to have them in the core module 
>>>>>>>> because
>>>>>>>> there's an appearance that they are a reasonable choice. I propose we
>>>>>>>> deprecate the HadoopTableOperations and HadoopCatalog implementations 
>>>>>>>> and
>>>>>>>> move them to tests the next time we can make breaking API changes 
>>>>>>>> (2.0).
>>>>>>>>
>>>>>>>> I think we should also consider similar fixes to the table spec. It
>>>>>>>> currently describes how HadoopTableOperations works, which does not 
>>>>>>>> work in
>>>>>>>> object stores or local file systems. HDFS is becoming much less common 
>>>>>>>> and
>>>>>>>> I propose that we note that the strategy in the spec should ONLY be 
>>>>>>>> used
>>>>>>>> with HDFS.
>>>>>>>>
>>>>>>>> What do other people think?
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>>
>>>>>>>

-- 
Ryan Blue
Databricks

Reply via email to