Hi everyone, I've recently started a discussion on Slack and was advised to post in the dev mailing list. As puffin/statistics files are starting to catch on, we are bound to come across situations where one writer wants to create a new statistics file while some data which it might not understand is already present in the current snapshot's statistics file. I've come across this problem in real life, when I ran `ANALYZE TABLE` in iceberg-spark, which created a new metadata file and replaced my proprietary index data with its own. You could argue that a single type of writer is expected for a table, but on the other hand, the spirit of Iceberg is portability. We can't know who's accessing the table and possibly corrupting its (statistics-)data.
Before I get into the proposed solutions, I think it's important to distinguish two scenarios in which statistics files are being written: data-changing and non-data-changing. For data-changing scenarios, I think it's reasonable to assume that old statistics files are no longer valid, and are therefore OK to replace. In the rest of this email, I will focus on scenarios where statistics are being generated and attached to the current snapshot via a new metadata file, as these are the problematic ones. After a short discussion in Slack, we roughly see three possible solutions. I think all of them require a change to the iceberg spec, but with varying gravity: 1. Enforce carry-over of unknown blob data into new puffin files. Pros: - Backwards-compatible reads, not only in terms of the iceberg spec, but also in terms of statistics files semantics. - Simple to implement because blob-level metadata is already available. - One reader could potentially understand statistics blobs calculated by different writers. Cons: - Write amplification. - Conflict resolution might require re-writing the whole file again. 2. Allow for multiple statistics files to be bound to a snapshot. Pros: - Avoids write amplification. - Each writer cares only about its own statistics file. - Finding relevant statistics files is easy thanks to file-level metadata. - One reader could understand statistics files written by different writers. Cons: - Backwards-incompatible reads. 3. Create new snapshot when computing statistics. Pros: - Avoids write amplification. - Each writer cares only about its own statistics files. Cons: - Requires readers to iterate over past snapshots in order to find last valid entry written by a compatible writer. I've definitely left some pros and cons out, but you can roughly map these cases to ways we handle existing file types (metadata, manifest lists, manifests). I'm sure people who have spent time designing the spec can more easily list out the possible pitfalls. In my humble opinion, #3 might be the most straightforward, but #2 is what I initially expected from the spec. We are doing #1 internally because it's the only thing we can do in the current situation. Let me know what you think. Cheers, Dzeri
publickey - [email protected] - 0x5E7E90EC.asc
Description: application/pgp-keys
signature.asc
Description: OpenPGP digital signature
