Hi Dan,
Thanks for the quick reply.

> For #2, the answer follows mostly because if the answer to #1 holds, then
> yes the pairwise intersection of entries in the manifest files of a given
> snapshot is empty.


Just to be pedantic, even with unique file names.  It seems one could
construct a snapshots as:
Manifest 1: Add File A
Manifest 2: Delete File A

>From your answer it sounds like this is unexpected and readers generally
don't try to reconcile Deletes add Adds?

Thanks,
Micah

On Fri, Mar 4, 2022 at 2:10 PM Daniel Weeks <dwe...@apache.org> wrote:

> Hey Micah,
>
> For #1, I don't believe spec clearly calls out that all data/delete files
> must be unique, but the requirements for cleanup would be violated in
> certain cases if you had the same file referenced in multiple manifests.
> In practice, the best way to ensure data correctness and metadata
> consistency is to ensure that all referenced files have unique locations
> and that those locations do not get overwritten.
>
> For #2, the answer follows mostly because if the answer to #1 holds, then
> yes the pairwise intersection of entries in the manifest files of a given
> snapshot is empty.
>
> The java library does perform some checks to prevent a file from being
> added to the same manifest multiple times, but I don't think that
> extends to all possible ways of adding files.  So it may be possible, but
> not a good idea.
>
> Sam might know if there's a way to add a nav for the format page (it is a
> little difficult to navigate at the moment).
>
> -Dan
>
> On Thu, Mar 3, 2022 at 4:49 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Hi Iceberg Dev,
>> I tried searching for it in the specification but couldn't find anything
>> explicit:
>>
>> 1.  Is it assumed that all data files and delete files will always have
>> globally unique names in a table?
>> 2.  Is it expected that the pairwise intersection of all manifest files
>> in a snapshot is empty (i.e. For any given data file it has exactly zero or
>> 1 entries across all manifest files in a snapshot)?
>>
>> I think the uniqueness of both can maybe be inferred by this sentence
>> (but I'm not 100% sure):
>>
>>> When a file is replaced or deleted from the dataset, it’s manifest entry
>>> fields store the snapshot ID in which the file was deleted and status 2
>>> (deleted). The file may be deleted from the file system when the snapshot
>>> in which it was deleted is garbage collected, assuming that older snapshots
>>> have also been garbage collected [1].
>>
>>
>> Thanks,
>> Micah
>>
>>
>> P.S. Is there a way to add a table of contents to the specification.  I
>> might be missing it but I don't see one rendered at:
>> https://iceberg.apache.org/spec/
>>
>

Reply via email to