Hi Everyone, Here are the minutes and video recording from our Iceberg Sync that took place on November 17th, 9am-10am PT. Please remember that anyone can join the discussion so feel free to share the Iceberg-Sync <https://groups.google.com/g/iceberg-sync> google group with anyone who is seeking an invite. As usual, the notes and the agenda are posted in the live doc <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web> that's also attached to the meeting invitation.
The recording has been shared with the Iceberg sync google group. If you have any issues accessing it, please let me know! Meeting Recording ⭕ <https://drive.google.com/file/d/1WEXy3VPgsLRIrjsMrHXVydmm4bbEdQBg/view?usp=sharing> Top of the Meeting Highlights - 0.12.1 Released! - Thanks to everyone who reviewed the release and thanks to Kyle for managing it! - Spark 3.2 Progress - added support for things like dynamic filtering to work with v2 sources as well as a new interface for driving sort-order through table properties. The changes here will be key for the merge into support with deltas. - Special thanks to Anton who’s been contributing a lot here! - Bug fixes - Avro read path - Vectorized reader in Spark - Delete File Compaction - The normal rewrite files compaction can be configured to detect too many delete files for a particular data file and compact them (Thanks Jack!) Upcoming 0.13.0 Release - Iceberg 0.13.0 Release Note Draft <https://docs.google.com/document/d/18yc8_Q6Hpc_r7JSoQO4oswQSHgHxJFDnr6Zif9_tceA/edit#heading=h.9jffz1lgqlib> - We’re aiming for releasing often so including pending changes in a future release is preferred over delaying a release to squeeze it in. - Spark regressions: For the Spark 3.2 branch, some major changes were expected for dynamic filtering and all of the row based commands so MERGE, DELETE FROM and UPDATE are missing in the 3.2 branch. We’re currently thinking through how to resolve this before the release, such as potentially porting them for now. - A new 0.13.0 milestone will be created soon - A release candidate can be expected soon, hopefully with the resequencing and Alibaba file io changes merged in Java and Python Catalog Consistency - On a per catalog implementation basis, it makes sense to keep the implementations aligned between the Java and Python clients - For now, let’s lean on thorough documentation for each catalog type and expected behaviors, and then generally look for this consistency during PR reviews - The REST catalog is probably the most suitable for providing a detailed catalog specification - Trying to achieve this consistency shouldn’t hold up any of the python development REST based Catalog - This provides a very flexible mechanism for creating various types of catalogs - Beyond conforming to the REST API specification, this creates room for a lot of variability on how the transactions are implemented server-side RemoveOrphanFilesAction - Pull Request #1471 <https://github.com/apache/iceberg/pull/1471> - Problem Description: Currently in delete orphan files we do a diff of valid data files and a listing of the directories. Differences in write configuration and the configuration when deleting orphan files can cause some orphan files to go undetected. - This has been discussed before and the conclusion was that we should not introduce configurations for ignoring certain components of uris. This causes other issues such as ignoring the authority for s3 which ignores the bucket in the uri. More complications are introduced when you consider that many tables can share a bucket/prefix. - Follow-up: Let’s try and get a comprehensive list of different scenarios and implications Trino Support for Merge on Read/Write - There are some serialization concerns here that need to be addressed and the current open PRs may get redesigned soon. - A lot of JSON serialization is being developed as part of the REST catalog implementation so that may solve some of the issues here. - Ideally, serialization can be kept somewhat separate from the rest of the code base. - Schema evolution implications need to be considered here as well. Thanks everyone!