Hey Iceberg Community, Here are the minutes and recording from our Iceberg Sync that took place today on *April 13th, 9am-10am PT*.
Always remember, anyone can join the discussion so feel free to share the Iceberg-Sync <https://groups.google.com/g/iceberg-sync> google group with anyone who is seeking an invite. The notes and the agenda are posted in the live doc <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web> that's also attached to the meeting invitation and it's a good place to add items as you see fit so we can discuss them in the next community sync. Meeting Recording ⭕ <https://drive.google.com/file/d/1SvJEckioxSITAOGqhNXB2vr_dUkjRrNx/view> Top of the Meeting Highlights - View Spec: The view spec has been committed and updated, so we’re ready to start implementing this. (Great addition to the format, thanks Anjali and the team at Netflix!) - Encryption: The basics for encryption have been committed. (Thanks Gideon! Also thanks to Jack and Russell for the reviews!) - Python: A lot of progress on the client: literals, schema, bucketing, and single value serde have all been added. (Thanks Steve, Sam, and Jun!) - V2 Tables in Trino: Trino can now read V2 tables! There are a lot of great activities in the Trino community around Iceberg support. - Default Values: LinkedIn is leading the conversation around adding support for default values where you can set a default value for a column in a table. Please weigh in on the issue and spec proposal since this will be a breaking change and so will go into the v3 spec. Releases - 0.13.2 patch release - There’s a single blocker here which is the Flink UPSERT bug. This is currently being worked on and needs a bit more digging. - 0.14.0 status update - Runtime jar license updates for new apache http client library - Drop support for Flink 1.12 (PR <https://github.com/apache/iceberg/pull/4551>) - Snapshot expiration with branches and tags. Amogh is working on this and has a few PRs in flight but is generally very close Snapshot Expiration w/ Branching + Tagging - Currently, there are two forms of expiration - Reachability Comparison - compares file reachability trees between two snapshots - Original Expiration - assumes a linear history of the table, so as you remove old snapshots, any data file deleted in that snapshot can be removed - Branching lacks visibility on what can actually be removed during snapshot expiration without knowledge of the entire metadata tree (reachability comparison) - Consensus: For the initial PR, let’s raise an error when expiring snapshots on a table with multiple branches. We can then follow up with the algorithm to perform snapshot expiration correctly Docs Site Navigation Flow Proposal (proposal diagram <https://drive.google.com/file/d/1ar7MaHkHkOwjTbFxGYTGB_ER8tWxnn99/view?usp=sharing> and worksheet <https://docs.google.com/document/d/1Y_PRv6p5oJaxg_68AUia_JHw8P4-AZIu3hP5IH2Cpsw/edit> ) - This is a proposal for re-organizing the docs site. - The diagram represents the navigation flow that should capture various personas - Please share any feedback on the diagram or any comments from your experience using the Iceberg docs Default Values - The target is SQL behavior for default values. This means changing default values does not change any actual data. - Adding a new column with a default value has the value appear for all existing records (without actually rewriting previous files) - New writes after the addition of a column with a default value, should write the actual value to the file if a value is not provided from the INSERT - Two kinds of defaults are emerging in the spec - Initial Default - Defaults used for existing data files written before the column existed - Write Default - Secondary default that’s used when writes occur without a value provided - Initial defaults will be set at table creation and at the same time, the value will be set for the write default. - The initial default cannot be changed but write defaults can be. - There are a lot of discussions currently happening around the behavior here so please chime in on the spec PR #4301 <https://github.com/apache/iceberg/pull/4301> Stats and Index File (format <https://github.com/apache/iceberg-docs/pull/69> and implementation <https://github.com/apache/iceberg/pull/4537>) - Stats - For this case we want to keep track of column-level sketches for NDVs - The current proposal is to use Apache DataSketches Theta Sketch <https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html> - Allows us to get an estimate of the number of distinct values in a column given a fairly limited size buffer - Apache DataSketches also has a C++ core library in addition to Java - The current spec is a footer based file-format for keeping track of sketches (KBs to a couple of MBs in size) - This is a backward-compatible and forward-compatible change since its informational - Please check out the proposal and the design doc by the Trino community and leave any comments/feedback Thanks everyone!