Hi everyone,

I've recently started a discussion on Slack and was advised to post in the dev 
mailing list.
As puffin/statistics files are starting to catch on, we are bound to come 
across situations where one writer wants to create a new statistics file while 
some data which it might not understand is already present in the current 
snapshot's statistics file. I've come across this problem in real life, when I 
ran `ANALYZE TABLE` in iceberg-spark, which created a new metadata file and 
replaced my proprietary index data with its own.
You could argue that a single type of writer is expected for a table, but on 
the other hand, the spirit of Iceberg is portability. We can't know who's 
accessing the table and possibly corrupting its (statistics-)data.

Before I get into the proposed solutions, I think it's important to distinguish 
two scenarios in which statistics files are being written: data-changing and 
non-data-changing.
For data-changing scenarios, I think it's reasonable to assume that old 
statistics files are no longer valid, and are therefore OK to replace. In the 
rest of this email, I will focus on scenarios where statistics are being 
generated and attached to the current snapshot via a new metadata file, as 
these are the problematic ones.

After a short discussion in Slack, we roughly see three possible solutions. I 
think all of them require a change to the iceberg spec, but with varying 
gravity:

      1. Enforce carry-over of unknown blob data into new puffin files.
           Pros:
             - Backwards-compatible reads, not only in terms of the iceberg 
spec, but also in terms of statistics files semantics.
             - Simple to implement because blob-level metadata is already 
available.
             - One reader could potentially understand statistics blobs 
calculated by different writers.
           Cons:
            - Write amplification.
            - Conflict resolution might require re-writing the whole file again.

       2. Allow for multiple statistics files to be bound to a snapshot.
           Pros:
            - Avoids write amplification.
            - Each writer cares only about its own statistics file.
            - Finding relevant statistics files is easy thanks to file-level 
metadata.
            - One reader could understand statistics files written by different 
writers.
           Cons:
            - Backwards-incompatible reads.

      3. Create new snapshot when computing statistics.
          Pros:
            - Avoids write amplification.
            - Each writer cares only about its own statistics files.
         Cons:
           - Requires readers to iterate over past snapshots in order to find 
last valid entry written by a compatible writer.

I've definitely left some pros and cons out, but you can roughly map these 
cases to ways we handle existing file types (metadata, manifest lists, 
manifests). I'm sure people who have spent time designing the spec can more 
easily list out the possible pitfalls. In my humble opinion, #3 might be the 
most straightforward, but #2 is what I initially expected from the spec. We are 
doing #1 internally because it's the only thing we can do in the current 
situation.

Let me know what you think.
Cheers,
Dzeri

Attachment: publickey - [email protected] - 0x5E7E90EC.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to