Oh I remember now, I think it was because HDFS semantics of rename fails when a file already exists. However, I think in the latest HDFS with FileContext API, an OVERWRITE flag can be passed to the context to make the rename succeed [1]:
> If OVERWRITE option is not passed as an argument, rename fails if the dst already exists. > If OVERWRITE option is passed as an argument, rename overwrites the dst if it is a file or an empty directory. Rename fails if dst is a non-empty directory. -Jack [1] https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileContext.html#rename-org.apache.hadoop.fs.Path-org.apache.hadoop.fs.Path-org.apache.hadoop.fs.Options.Rename...- On Wed, Jul 31, 2024 at 8:04 AM Russell Spitzer <russell.spit...@gmail.com> wrote: > My guess would be to avoid complications with multiple committers > attempting to swap at the same time. > > On Wed, Jul 31, 2024 at 9:50 AM Jack Ye <yezhao...@gmail.com> wrote: > >> I see, thank you Fokko, this is a very helpful context. >> >> Looking at the discussion in the PR and discussions in it, it seems like >> the version hint file is the key problem here. The file system table spec >> [1] is technically correct and only uses a single rename operation to >> perform the atomic commit, and defines that the v<version>.metadata.json is >> the latest file. However the additional write of a version hint file seems >> problematic as that could have additional failures and cause all sorts of >> edge case behaviors, and is not really strictly following the spec. >> >> I agree that if we want to properly follow the current file system table >> spec, then the right way is to stop the commit process after renaming to >> the v<version>.metadata.json, and the reader should perform a listing to >> discover the latest metadata file. If we go with that, this is >> interestingly becoming highly similar to the Delta Lake protocol, where the >> zero-padded log files [2] are discovered using this mechanism [3] I >> believe. And they have implementations for different storage systems >> including HDFS, S3, Azure, GCS, with a pluggable extension point. >> >> One question I have now: what is the motivation in the file system table >> spec to rename the latest table metadata to v<version>.metadata.json, >> rather than just a fixed name like latest.metadata.json? Why is the version >> number in the file name important? >> >> -Jack >> >> [1] https://iceberg.apache.org/spec/#file-system-tables >> [2] >> https://github.com/delta-io/delta/blob/master/PROTOCOL.md#delta-log-entries >> [3] >> https://github.com/delta-io/delta/blob/master/storage/src/main/java/io/delta/storage/LogStore.java#L116 >> >> >> >> On Tue, Jul 30, 2024 at 10:52 PM Fokko Driesprong <fo...@apache.org> >> wrote: >> >>> Jack, >>> >>> no atomic drop table support: this seems pretty fixable, as you can >>>> change the semantics of dropping a table to be deleting the latest table >>>> version hint file, instead of having to delete everything in the folder. I >>>> feel that actually also fits the semantics of purge/no-purge better. >>> >>> >>> I would invite you to check out lisoda's PR >>> <https://github.com/apache/iceberg/pulls/BsoBird> (#9546 >>> <https://github.com/apache/iceberg/pull/9546> is an earlier version >>> with more discussion) that works towards removing the version hint file to >>> avoid discrepancies between the latest committed metadata and the metadata >>> that's referenced in the hint file. These can go out of sync since the >>> operation there is not atomic. Removing this introduces other problems >>> where you have to determine the latest version of the metadata using >>> prefix-listing, which is not efficient and desirable on an object store as >>> you already mentioned. >>> >>> Kind regards, >>> Fokko >>> >>> Op wo 31 jul 2024 om 04:39 schreef Jack Ye <yezhao...@gmail.com>: >>> >>>> Atomicity is just one requirement, and it also needs to be efficient, >>>> desirably a metadata-only operation. >>>> >>>> Looking at some documentations of GCS [1], the rename operation is >>>> still a COPY + DELETE behind the scene unless it is a hierarchical >>>> namespace bucket. The Azure documentation [2] also suggests that the fast >>>> rename feature is only available with hierarchical namespace that is for >>>> the Gen2 buckets. I found little documentation about the exact rename >>>> guarantee and semantics of ADLS though. But it is undeniable that at least >>>> GCS and Azure should be able to work with HadoopCatalog pretty well with >>>> their latest offerings. >>>> >>>> Steve, if you could share more insights to this and related >>>> documentations, that would be really appreciated. >>>> >>>> -Jack >>>> >>>> [1] https://cloud.google.com/storage/docs/rename-hns-folders >>>> [2] >>>> https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace#the-benefits-of-a-hierarchical-namespace >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Jul 30, 2024 at 11:11 AM Steve Loughran >>>> <ste...@cloudera.com.invalid> wrote: >>>> >>>>> >>>>> >>>>> On Thu, 18 Jul 2024 at 00:02, Ryan Blue <b...@apache.org> wrote: >>>>> >>>>>> Hey everyone, >>>>>> >>>>>> There has been some recent discussion about improving >>>>>> HadoopTableOperations and the catalog based on those tables, but we've >>>>>> discouraged using file system only table (or "hadoop" tables) for years >>>>>> now >>>>>> because of major problems: >>>>>> * It is only safe to use hadoop tables with HDFS; most local file >>>>>> systems, S3, and other common object stores are unsafe >>>>>> >>>>> >>>>> Azure storage and linux local filesystems all support atomic file and >>>>> dir rename an delete; google gcs does it for files and dirs only. Windows, >>>>> well, anybody who claims to understand the semantics of MoveFile is >>>>> probably wrong ( >>>>> https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-movefilewithprogressw >>>>> ) >>>>> >>>>> * Despite not providing atomicity guarantees outside of HDFS, people >>>>>> use the tables in unsafe situations >>>>>> >>>>> >>>>> which means "s3", unless something needs directory rename >>>>> >>>>> >>>>>> * HadoopCatalog cannot implement atomic operations for rename and >>>>>> drop table, which are commonly used in data engineering >>>>>> * Alternative file names (for instance when using metadata file >>>>>> compression) also break guarantees >>>>>> >>>>>> While these tables are useful for testing in non-production >>>>>> scenarios, I think it's misleading to have them in the core module >>>>>> because >>>>>> there's an appearance that they are a reasonable choice. I propose we >>>>>> deprecate the HadoopTableOperations and HadoopCatalog implementations and >>>>>> move them to tests the next time we can make breaking API changes (2.0). >>>>>> >>>>>> I think we should also consider similar fixes to the table spec. It >>>>>> currently describes how HadoopTableOperations works, which does not work >>>>>> in >>>>>> object stores or local file systems. HDFS is becoming much less common >>>>>> and >>>>>> I propose that we note that the strategy in the spec should ONLY be used >>>>>> with HDFS. >>>>>> >>>>>> What do other people think? >>>>>> >>>>>> Ryan >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> >>>>>