Hi Everyone, Here are the minutes and video recording from our Iceberg Sync that took place on December 8th, 9am-10am PT. Please remember that anyone can join the discussion so feel free to share the Iceberg-Sync <https://groups.google.com/g/iceberg-sync> google group with anyone who is seeking an invite. As usual, the notes and the agenda are posted in the live doc <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web> that's also attached to the meeting invitation.
Meeting Recording ⭕ <https://drive.google.com/file/d/1cLg8bc1JTslalYpvd5AF3OYN7U_ixO_x/view> Top of the Meeting Highlights - Flink 1.14 Support: This has recently been merged in and we’re using the same mechanism for the spark versions, so this was a copy over of the Flink 1.13 Iceberg runtime and updated for 1.14 support. - Rewrite Data Files Stored Procedure: rewrite data files procedures can now be called from Spark-SQL, including with filters. (Thanks Ajantha!) - OSSFileIO: This has been added which is super helpful for many users in China using OSS storage - Resequencing Data Files during Rewrites: This allows us to keep equality deletes flowing into a table and perform rewrites without conflicts. (Thanks Jack!) Upcoming 0.13.0 Release - Iceberg 0.13.0 Release Note Draft <https://docs.google.com/document/d/18yc8_Q6Hpc_r7JSoQO4oswQSHgHxJFDnr6Zif9_tceA/edit#heading=h.9jffz1lgqlib> - Current biggest blocker is Spark 3.2 feature regressions - The plan is to release with just Copy-On-Write and follow-up with addressing any 3.2 regressions. This is to avoid holding up releases and we can always do an out of cycle release soon Flink State Issue - A field was added to the middle of datafile so that it can be added to the files table. The mismatch of the read and write schema causes an issue with Flink because of the assumption that the read and write schemas are identical. This prevents recovering from older checkpoints and so is something that should be fixed as soon as possible. (Added to the 0.13 milestone) OSS Runtime Bundle - Runtime bundles require a large amount of engineering bandwidth to focus on very specific packaging details, particularly around licensing. It makes sense to do this for the Spark and Flink runtimes but if vendors need other specific bundles, the recommended direction should be for them to build and maintain their own runtime bundle. HadoopCatalog - For filesystem-only tables to be safe, some kind of lock implementation will be needed. This can be facilitated by a lock manager API. - A disclaimer should be included to discourage using this catalog implementation for production workloads Iceberg Event Notification Support - Start supporting event notifications for Iceberg - Can facilitate hooking up to SNS, SQS, Cloudwatch, etc - Should event listeners be configured at the catalog level or the table level? - It seems like the catalog level is the better path forward (something similar to how FileIO is implemented). There may exist use-cases out there for table-level configuration but let’s wait for those to appear. REST catalog spec: Multipart namespaces - The Iceberg catalog allows multipart namespaces, originally to support hadoop tables where a directory can serve as a namespace part - Not all engines support nested catalog names - How should we represent this in REST paths? (Slashes vs. Dots to separate namespace parts) - Dots between namespace parts seems to be the path forward - Easy to encode dots in the name - Theoretically could have collisions, a.b could be “b nested within a” or”a column named a.b”. However, Iceberg already guards against this so it shouldn’t be an issue. Thank you!