Good Morning Everyone, Here are the minutes from our Iceberg Sync that took place on October 20th, 9am-10am PT. Please remember that anyone can join the discussion so feel free to share the Iceberg-Sync <https://groups.google.com/g/iceberg-sync> google group with anyone who is seeking an invite. As usual, the notes and the agenda are posted in the live doc <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web> that's also attached to the meeting invitation.
We covered a lot of topics...here we go!: Top of the Meeting Highlights - Sort based compaction - This is finished, reviewed, and merged. When you compact data files, you can now also have Spark re-sort them, either by the table’s sort order or the sort order given when you create the compaction job. - Spark build refactor: Thank you to Jack for getting us started on the Spark build refactor and also thanks to Anton for reviewing and helping get these changes in. We’ve gone with a variant of option 3 from our last discussions where we include all of the spark modules in our build but make it easy to turn them off. This way we can get the CI to run Spark, Hive, and Flink tests separately and only if necessary. - Delete files implementation for ORC: Thanks to Peter for adding builders to store deletes in ORC (previously we could only store deletes in Parquet or Avro). This means we now have support for all 3 formats for this feature. - Flink Update: We’ve updated Flink to 1.13 so we’re back on a supported version. 1.14 is out this week so we can aim to move to that at some point. Iceberg 0.12.1 Upcoming Patch Release (milestone <https://github.com/apache/iceberg/milestone/15?closed=1>) - Fix for the parquet map projection bug - Fix Flink CDC bug - A few other fixes that we also want to get out to the community so we’re going to start a release candidate as soon as possible - Kyle will start a thread in the general slack channel so everyone please feel free to mention any additional fixes that they want to see in this patch release Snapshot Releases - Eduard will tackle adding snapshot releases - In our deploy.gradle file, it’s setup to deploy to the snapshot repository - May require certain credentials so it may be required to reach out to the ASF infrastructure team Iceberg 0.13.0 Upcoming Release - There’s agreement to switch to a time based release schedule so the next release is roughly mid-November - Jack will cut a branch close to that time and any features that aren’t in yet will be pushed to the next release - We agree not to hold up releases to squeeze features in and prefer instead to aim for releasing sooner the next time Adding v3.2 to Spark Build Refactoring - Russell and Anton will coordinate on dropping in a Spark 3.2 module - We currently have 3.1 in the `spark3` module. We’ll move that out to its own module and mirror what we do with the 3.2 module. (This will enable cleaning up some mixed 3.0/3.1 code) Merge on Read - Anton has a bunch of PRs ready to queue up to contribute their internal implementation. (Russell will work with him) - This feature will allow for a much lower write amplification - The expectation is that in Spark 3.3 we can rely on Spark’s internal merge on read Snapshot Tagging (design doc <https://docs.google.com/document/d/1PvxK_0ebEoX3s7nS6-LOJJZdBYr_olTWH9oepNUfJ-A/edit>) (PR #3104 <https://github.com/apache/iceberg/pull/3104>) - We just had a meeting on Monday about that and made some conclusions and designs, so anyone who is interested please take a look. - Next steps are to add the feature in the stack and Jack already has a WIP implementation into the table metadata class Delete Compaction (design doc <https://docs.google.com/document/d/1-EyKSfwd_W9iI5jrzAvomVw3w1mb_kayVNT7f2I-SUg> ) - Discussion happening at 5pm ET on 10/21 5-6pm PT for anyone interested (meeting link <https://meet.google.com/nxx-nnvj-omx>) - Some more discussion is needed to hone in on a final design choice. There are a few options that each have their own pros and cons. The New Source Interface for Flink (FLIP-27 <https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface> ) - Eventually everything will move to this new source interface (Kafka is already using this and it will be the default in Flink 1.14) - A few PR’s for Iceberg are out there and are pending review and merge (may not make the deadline for the next release but that’s ok) Encryption MVP - Just had a recent sync on this and we're currently waiting on a few updates to the design - Flesh out how the new pushdown encryption into ORC and Parquet will work - Need some people to review the stream based encryption, particularly around splitability - A few offline discussions are currently happening and for the interface we are expecting a few additional PRs separate from the main encryption MVP PR Python Library Development - The high level design discussions have been concluded recently - We’ll delay the top level API discussions until some of the core is implemented - We have a collection of issues created and a handful of engineers working on it Iceberg Docsite Refactoring - Large refactoring coming for the Iceberg docsite - Versioned docs (In the future need to decide how to represent the python versions) - Organized more by the persona of the visitor (Data Engineer, Systems Engineer, etc.) - Searchable - Expect a PR from Sam, ready for review by the end of this week or early next week Row-Level Support in the Vectorized Reader (PR #3141 <https://github.com/apache/iceberg/issues/3141>) - Yufei is working on this and it’s part of the effort for merge on read - PR #3287 <https://github.com/apache/iceberg/pull/3287> is only for the position delete in parquet - We should have something ready to add by next week View Spec (PR #3188 <https://github.com/apache/iceberg/pull/3188>) - There was a discussion on if we should we have just the SQL text exactly as it was passed to the engine or should we also include the parsed and analyze plan (includes column resolution). In theory, the resolved SQL text should be very useful but it’s usefulness may be limited to certain edge cases. - The broader discussion here is: Should we allow having multiple dialects (Trino, Spark, etc..) - Adds complexity - Time traveling needs to be considered. What does time traveling a view mean? If the underlying table is an Iceberg table we may be able to, but even that would require “as of” time travel to allow time travel across multiple tables. - Time traveling schemas needs to be added - Agreement that we should not try to solve everything at once but break this into smaller problems. - Let’s keep an eye on upcoming engine features to see if this will be implicitly solved and let's also refrain from over engineering this. That's it! Thanks everyone for the high level of participation and enjoy the rest of your week!