Hey Iceberg Community, Here are the minutes and recording from our Iceberg Sync that took place today on *May 25th, 9am-10am PT*.
Always remember, anyone can join the discussion so feel free to share the Iceberg-Sync <https://groups.google.com/g/iceberg-sync> google group with anyone who is seeking an invite. The notes and the agenda are posted in the live doc <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web> that's also attached to the meeting invitation and it's a good place to add items as you see fit so we can discuss them in the next community sync. Meeting Recording ⭕ <https://drive.google.com/file/d/1FISvrM3eEWQuIfQnZAKbRN0Xyk2hH36_/view?usp=sharing> Top of the Meeting Highlights - Added an incremental append scan interface (Thanks, Steven!) - Backports for 0.13.2 are done (Thanks, Eduard!) - API validation using revapi was added (Thanks, Kyle!) - Added all_files and all_delete_files metadata tables (Thanks, Szehon!) Releases - 0.13.2 - All of the backports are merged - Milestone <https://github.com/apache/iceberg/milestone/18?closed=1> with merged PRs - 1.0.0 (no 0.14 release) - LICENSE updates done - API checking is done - Incremental snapshot expiration still pending - Metadata table schema guarantees - An alternative option is to release an 0.14 with a quick follow-up 1.0.0 release that removed any deprecations Agenda - Minimum supported python version changed from 3.7 to 3.8 - Proposal to change from tox to pre-commit: PR #4811 <https://github.com/apache/iceberg/pull/4811> - Change scan: PR #4870 <https://github.com/apache/iceberg/pull/4870> - Incremental Scans - Most CDC operations require excluding rows that are unchanged or join rows by ID to create a pre-image/post-image - Some difficulties around using DataSourceV2 (getting pure deleted/inserted requires shuffling) - One option is defining a view that uses incremental scans to do a pre-image/post-image analysis (View catalog has not been added to Spark yet but there’s an existing SPIP-31357 <https://issues.apache.org/jira/browse/SPARK-31357> and PR #35636 <https://github.com/apache/spark/pull/35636>) - Puffin - new name for Index and Stats file-format - Secondary index metadata as blobs of binary data - Theta sketches - Snapshot branching and tagging syntax - Option 1: `<database>.<table>.<branch_name>` - Potential conflicts with metadata table names (possibly a rare occurence) - Option 2: Qualifying prefixes such as `branch$<branch_name>` or `tag:<tag_name>`. - No standardized way in SQL to specify a tag or branch, which could delay implementation upstream - Option 3: An option/context setting - Setting options currently not possible in spark-sql - Before implementing any of this logic, let’s work through a proposal Thanks everyone! -- Sam Redai <s...@tabular.io> Developer Advocate | Tabular <https://tabular.io/>