Hey Iceberg Community, Anurag started a separate, focused discussion <https://lists.apache.org/thread/jbh1gbrso5h6l4by9rh9poy2cjjtb8j0> on the column update file representation, similarly, let me start another one for the metadata representation. Hopefully, we can make some iterations on this before the next sync.
We covered this topic in the sync yesterday and agreed on some of the fields, but we left the "tracking" information part open. The *required* fields we agreed on so far: ColumnFile field_ids list<int> location string file_size_in_bytes long *Tracking information* Additionally to the above, we discussed the need of tracking information. These are the potential ones: *1) Sequence number* - Usage for _last_updated_sequence_number I did think about how to produce _last_updated_sequence_number and I think technically we don't need to store the sequence number on the update file level for that. I wrote up the steps here <https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?pli=1&tab=t.xvm52pv4m7lq>, but in a nutshell: we could either fille that from the _last_updated_sequence_number written into the latest column file, or if null we can use the base file's file_sequence_number. - Usage for equality deletes As we agreed previously, we don't want to support update files together with equality deletes, so we won't need to store column file level sequence numbers for this either. - Usage for CDC, observability, etc. I'm wondering if there is any use case where we want to see the order of the column updates to see the sequence they were created. If this matters for CDC or reproducibility or anything else, then let's have a column file level sequence number too, if not, we can omit this. *2) Status* I think, similarly to TrackedFile, we need the following statuses here: EXISTING, ADDED, DELETED, REPLACED With these, when the base file's status is REPLACED, taking a look at the column_files we can know exactly what has changed wrt the column updates. Some examples to demonstrate: Step 1: Start with an existing base file: base file: {location: "file.parquet", seq_num: 1, file_seq_num: 1, status: EXISTING, column_files:[]} Step 2: Adding a column update for field IDs [1, 2]: base file: {location: "file.parquet", seq_num: 1, file_seq_num: *2*, status: *REPLACED*, column_files: [ *{field_ids: [1, 2], location: "update1.parquet", status: ADDED}* ]} Step 3: Adding an overlapping column update with field IDs [2, 3] ("de-duplicate" field IDs): base file: {location: "file.parquet", seq_num: 1, file_seq_num: *3*, status: REPLACED, column_files: [ {field_ids: *[1],* location: "update1.parquet", status: *REPLACED}, **{field_ids: [2, 3], location: "update2.parquet", status: ADDED}* ]} Step 4: Add another column update for field ID [1] to completely eliminate one previous update file from metadata base file: {location: "file.parquet", seq_num: 1, file_seq_num: *4*, status: REPLACED, column_files: [ {field_ids: [1]*,* location: "update1.parquet", status: *DELETED},* {field_ids: [2, 3], location: "update2.parquet", status: *EXISTING*}, *{field_ids: [1], location: "update3.parquet", status: ADDED}* ]} *Thoughts on REPLACED* In step 3, we marked the existing column file as REPLACED while reducing the field_ids list to de-duplicate them with the incoming update file's field_ids. With this, REPLACED indicates that field_ids content was reduced, however, we won't know exactly what field IDs were removed. - Alternative approach 1: We could use DELETED status leaving the field ID list intact, and then create a new ColumnFile with the reduced list. Step 3 would look like this: base file: {location: "file.parquet", seq_num: 1, file_seq_num: *3*, status: REPLACED, column_files: [ {field_ids: [1, 2]*,* location: "update1.parquet", status: *DELETED}, **{field_ids: [1], location: "update1.parquet", status: ADDED}, **{field_ids: [2, 3], location: "update2.parquet", status: ADDED}* ]} - Alternative approach 2: We can use REPLACED as originally, and also have a field in the tracking data to *keep track of the removed field IDs* (similarly to Tracking.DELETED_POSITIONS). Step 3 would look like this: base file: {location: "file.parquet", seq_num: 1, file_seq_num: *3*, status: REPLACED, column_files: [ {field_ids: *[1],* location: "update1.parquet", status: *REPLACED, removed_field_ids: [2]}, **{field_ids: [2, 3], location: "update2.parquet", status: ADDED}* ]} - Preference: I think the REPLACED approach is cleaner, I'd prefer that. In case we want to track what IDs were removed, we could follow "alternative approach 2". - Additional, note: Re-writing the column file as REPLACED shouldn't alter the sequence number of the column file (if we decide to have one). *3) Snapshot ID* 'Tracking' has this, I think it could make sense for column files too. *4) First row ID* Row IDs should come from the base file's metadata IMO, we shouldn't store this for the update files. *Summary of all the potential tracking fields:* ColumnFileTracking required status int optional snapshot_id long optional sequence_number long optional removed_field_ids list<int> *Field IDs* The first free field ID within TrackedFile is 157. The last used one is DeletionVector.CARDINALITY <https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/DeletionVector.java#L42> with field ID 156. I'm working with Amogh to coordinate assigning the required field IDs here. Let me know if I miss anything here! Any feedback is appreciated! Best Regards, Gabor
