My guess would be to avoid complications with multiple committers attempting to swap at the same time.
On Wed, Jul 31, 2024 at 9:50 AM Jack Ye <yezhao...@gmail.com> wrote: > I see, thank you Fokko, this is a very helpful context. > > Looking at the discussion in the PR and discussions in it, it seems like > the version hint file is the key problem here. The file system table spec > [1] is technically correct and only uses a single rename operation to > perform the atomic commit, and defines that the v<version>.metadata.json is > the latest file. However the additional write of a version hint file seems > problematic as that could have additional failures and cause all sorts of > edge case behaviors, and is not really strictly following the spec. > > I agree that if we want to properly follow the current file system table > spec, then the right way is to stop the commit process after renaming to > the v<version>.metadata.json, and the reader should perform a listing to > discover the latest metadata file. If we go with that, this is > interestingly becoming highly similar to the Delta Lake protocol, where the > zero-padded log files [2] are discovered using this mechanism [3] I > believe. And they have implementations for different storage systems > including HDFS, S3, Azure, GCS, with a pluggable extension point. > > One question I have now: what is the motivation in the file system table > spec to rename the latest table metadata to v<version>.metadata.json, > rather than just a fixed name like latest.metadata.json? Why is the version > number in the file name important? > > -Jack > > [1] https://iceberg.apache.org/spec/#file-system-tables > [2] > https://github.com/delta-io/delta/blob/master/PROTOCOL.md#delta-log-entries > [3] > https://github.com/delta-io/delta/blob/master/storage/src/main/java/io/delta/storage/LogStore.java#L116 > > > > On Tue, Jul 30, 2024 at 10:52 PM Fokko Driesprong <fo...@apache.org> > wrote: > >> Jack, >> >> no atomic drop table support: this seems pretty fixable, as you can >>> change the semantics of dropping a table to be deleting the latest table >>> version hint file, instead of having to delete everything in the folder. I >>> feel that actually also fits the semantics of purge/no-purge better. >> >> >> I would invite you to check out lisoda's PR >> <https://github.com/apache/iceberg/pulls/BsoBird> (#9546 >> <https://github.com/apache/iceberg/pull/9546> is an earlier version with >> more discussion) that works towards removing the version hint file to avoid >> discrepancies between the latest committed metadata and the metadata that's >> referenced in the hint file. These can go out of sync since the operation >> there is not atomic. Removing this introduces other problems where you have >> to determine the latest version of the metadata using prefix-listing, which >> is not efficient and desirable on an object store as you already mentioned. >> >> Kind regards, >> Fokko >> >> Op wo 31 jul 2024 om 04:39 schreef Jack Ye <yezhao...@gmail.com>: >> >>> Atomicity is just one requirement, and it also needs to be efficient, >>> desirably a metadata-only operation. >>> >>> Looking at some documentations of GCS [1], the rename operation is still >>> a COPY + DELETE behind the scene unless it is a hierarchical namespace >>> bucket. The Azure documentation [2] also suggests that the fast rename >>> feature is only available with hierarchical namespace that is for the Gen2 >>> buckets. I found little documentation about the exact rename guarantee and >>> semantics of ADLS though. But it is undeniable that at least GCS and Azure >>> should be able to work with HadoopCatalog pretty well with their latest >>> offerings. >>> >>> Steve, if you could share more insights to this and related >>> documentations, that would be really appreciated. >>> >>> -Jack >>> >>> [1] https://cloud.google.com/storage/docs/rename-hns-folders >>> [2] >>> https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace#the-benefits-of-a-hierarchical-namespace >>> >>> >>> >>> >>> >>> >>> >>> >>> On Tue, Jul 30, 2024 at 11:11 AM Steve Loughran >>> <ste...@cloudera.com.invalid> wrote: >>> >>>> >>>> >>>> On Thu, 18 Jul 2024 at 00:02, Ryan Blue <b...@apache.org> wrote: >>>> >>>>> Hey everyone, >>>>> >>>>> There has been some recent discussion about improving >>>>> HadoopTableOperations and the catalog based on those tables, but we've >>>>> discouraged using file system only table (or "hadoop" tables) for years >>>>> now >>>>> because of major problems: >>>>> * It is only safe to use hadoop tables with HDFS; most local file >>>>> systems, S3, and other common object stores are unsafe >>>>> >>>> >>>> Azure storage and linux local filesystems all support atomic file and >>>> dir rename an delete; google gcs does it for files and dirs only. Windows, >>>> well, anybody who claims to understand the semantics of MoveFile is >>>> probably wrong ( >>>> https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-movefilewithprogressw >>>> ) >>>> >>>> * Despite not providing atomicity guarantees outside of HDFS, people >>>>> use the tables in unsafe situations >>>>> >>>> >>>> which means "s3", unless something needs directory rename >>>> >>>> >>>>> * HadoopCatalog cannot implement atomic operations for rename and drop >>>>> table, which are commonly used in data engineering >>>>> * Alternative file names (for instance when using metadata file >>>>> compression) also break guarantees >>>>> >>>>> While these tables are useful for testing in non-production scenarios, >>>>> I think it's misleading to have them in the core module because there's an >>>>> appearance that they are a reasonable choice. I propose we deprecate the >>>>> HadoopTableOperations and HadoopCatalog implementations and move them to >>>>> tests the next time we can make breaking API changes (2.0). >>>>> >>>>> I think we should also consider similar fixes to the table spec. It >>>>> currently describes how HadoopTableOperations works, which does not work >>>>> in >>>>> object stores or local file systems. HDFS is becoming much less common and >>>>> I propose that we note that the strategy in the spec should ONLY be used >>>>> with HDFS. >>>>> >>>>> What do other people think? >>>>> >>>>> Ryan >>>>> >>>>> -- >>>>> Ryan Blue >>>>> >>>>