I think in the situation you're demonstrating, the manifests are separated
across two separate snapshots.

Here's an example:

create table t1 (s string);
insert into t1 values ('foo');  -- snapshot 0, manifest-list with 1
manifest pointing to file A (ADDED)
insert into t1 values ('bar'); -- snapshot 1, manifest-list with 2
manifests pointing to file A (ADDED),  file B (ADDED)
delete from t1 where s = 'foo'; -- snapshot 2, manifest-list with 2
manifests pointing to file A (DELETED),  file B (ADDED)

The paths are not unique across snapshots 1/2, but within each snapshot
they are.

Now in the same case if the data was in the same file, you would have a
rewrite of the datafile like this (assuming no row-level deletes):

create table t1 (s string);
insert into t1 values ('foo'), ('bar');  -- snapshot 0, manifest-list with
1 manifest pointing to file A (ADDED)
delete from t1 where s = 'foo'; -- snapshot 1, manifest-list with 1
manifests pointing to file A (DELETED) + file B (ADDED)

I hope I'm understanding your example correctly, but let me know if I'm off
track here.

Thanks,
Dan



On Fri, Mar 4, 2022 at 2:23 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Dan,
> Thanks for the quick reply.
>
>
>> For #2, the answer follows mostly because if the answer to #1 holds, then
>> yes the pairwise intersection of entries in the manifest files of a given
>> snapshot is empty.
>
>
> Just to be pedantic, even with unique file names.  It seems one could
> construct a snapshots as:
> Manifest 1: Add File A
> Manifest 2: Delete File A
>
> From your answer it sounds like this is unexpected and readers generally
> don't try to reconcile Deletes add Adds?
>
> Thanks,
> Micah
>
> On Fri, Mar 4, 2022 at 2:10 PM Daniel Weeks <dwe...@apache.org> wrote:
>
>> Hey Micah,
>>
>> For #1, I don't believe spec clearly calls out that all data/delete files
>> must be unique, but the requirements for cleanup would be violated in
>> certain cases if you had the same file referenced in multiple manifests.
>> In practice, the best way to ensure data correctness and metadata
>> consistency is to ensure that all referenced files have unique locations
>> and that those locations do not get overwritten.
>>
>> For #2, the answer follows mostly because if the answer to #1 holds, then
>> yes the pairwise intersection of entries in the manifest files of a given
>> snapshot is empty.
>>
>> The java library does perform some checks to prevent a file from being
>> added to the same manifest multiple times, but I don't think that
>> extends to all possible ways of adding files.  So it may be possible, but
>> not a good idea.
>>
>> Sam might know if there's a way to add a nav for the format page (it is a
>> little difficult to navigate at the moment).
>>
>> -Dan
>>
>> On Thu, Mar 3, 2022 at 4:49 PM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> Hi Iceberg Dev,
>>> I tried searching for it in the specification but couldn't find anything
>>> explicit:
>>>
>>> 1.  Is it assumed that all data files and delete files will always have
>>> globally unique names in a table?
>>> 2.  Is it expected that the pairwise intersection of all manifest files
>>> in a snapshot is empty (i.e. For any given data file it has exactly zero or
>>> 1 entries across all manifest files in a snapshot)?
>>>
>>> I think the uniqueness of both can maybe be inferred by this sentence
>>> (but I'm not 100% sure):
>>>
>>>> When a file is replaced or deleted from the dataset, it’s manifest
>>>> entry fields store the snapshot ID in which the file was deleted and status
>>>> 2 (deleted). The file may be deleted from the file system when the snapshot
>>>> in which it was deleted is garbage collected, assuming that older snapshots
>>>> have also been garbage collected [1].
>>>
>>>
>>> Thanks,
>>> Micah
>>>
>>>
>>> P.S. Is there a way to add a table of contents to the specification.  I
>>> might be missing it but I don't see one rendered at:
>>> https://iceberg.apache.org/spec/
>>>
>>

Reply via email to