Added as an item to the sync doc, thank you! -Sam
On Thu, Oct 21, 2021 at 11:35 PM OpenInx <open...@gmail.com> wrote: > Thanks for the detailed report ! > > One more thing: We now have made a lot of progress in integrating Alibaba > Cloud (https://www.aliyun.com/), Please see > https://github.com/apache/iceberg/projects/21 (Thanks @xingbowu - > https://github.com/xingbowu). > > On Thu, Oct 21, 2021 at 11:30 PM Sam Redai <s...@tabular.io> wrote: > >> Good Morning Everyone, >> >> Here are the minutes from our Iceberg Sync that took place on October >> 20th, 9am-10am PT. Please remember that anyone can join the discussion so >> feel free to share the Iceberg-Sync >> <https://groups.google.com/g/iceberg-sync> google group with anyone who >> is seeking an invite. As usual, the notes and the agenda are posted in the >> live >> doc >> <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web> >> that's >> also attached to the meeting invitation. >> >> We covered a lot of topics...here we go!: >> >> Top of the Meeting Highlights >> >> - >> >> Sort based compaction - This is finished, reviewed, and merged. When >> you compact data files, you can now also have Spark re-sort them, either >> by >> the table’s sort order or the sort order given when you create the >> compaction job. >> - >> >> Spark build refactor: Thank you to Jack for getting us started on the >> Spark build refactor and also thanks to Anton for reviewing and helping >> get >> these changes in. We’ve gone with a variant of option 3 from our last >> discussions where we include all of the spark modules in our build but >> make >> it easy to turn them off. This way we can get the CI to run Spark, Hive, >> and Flink tests separately and only if necessary. >> - >> >> Delete files implementation for ORC: Thanks to Peter for adding >> builders to store deletes in ORC (previously we could only store deletes >> in >> Parquet or Avro). This means we now have support for all 3 formats for >> this >> feature. >> - >> >> Flink Update: We’ve updated Flink to 1.13 so we’re back on a >> supported version. 1.14 is out this week so we can aim to move to that at >> some point. >> >> Iceberg 0.12.1 Upcoming Patch Release (milestone >> <https://github.com/apache/iceberg/milestone/15?closed=1>) >> >> - >> >> Fix for the parquet map projection bug >> - >> >> Fix Flink CDC bug >> - >> >> A few other fixes that we also want to get out to the community so >> we’re going to start a release candidate as soon as possible >> - >> >> Kyle will start a thread in the general slack channel so everyone >> please feel free to mention any additional fixes that they want to see in >> this patch release >> >> Snapshot Releases >> >> - >> >> Eduard will tackle adding snapshot releases >> - >> >> In our deploy.gradle file, it’s setup to deploy to the snapshot >> repository >> - >> >> May require certain credentials so it may be required to reach out to >> the ASF infrastructure team >> >> Iceberg 0.13.0 Upcoming Release >> >> - >> >> There’s agreement to switch to a time based release schedule so the >> next release is roughly mid-November >> - >> >> Jack will cut a branch close to that time and any features that >> aren’t in yet will be pushed to the next release >> - >> >> We agree not to hold up releases to squeeze features in and prefer >> instead to aim for releasing sooner the next time >> >> Adding v3.2 to Spark Build Refactoring >> >> - >> >> Russell and Anton will coordinate on dropping in a Spark 3.2 module >> - >> >> We currently have 3.1 in the `spark3` module. We’ll move that out to >> its own module and mirror what we do with the 3.2 module. (This will >> enable >> cleaning up some mixed 3.0/3.1 code) >> >> Merge on Read >> >> - >> >> Anton has a bunch of PRs ready to queue up to contribute their >> internal implementation. (Russell will work with him) >> - >> >> This feature will allow for a much lower write amplification >> - >> >> The expectation is that in Spark 3.3 we can rely on Spark’s internal >> merge on read >> >> Snapshot Tagging (design doc >> <https://docs.google.com/document/d/1PvxK_0ebEoX3s7nS6-LOJJZdBYr_olTWH9oepNUfJ-A/edit>) >> (PR #3104 <https://github.com/apache/iceberg/pull/3104>) >> >> - >> >> We just had a meeting on Monday about that and made some conclusions >> and designs, so anyone who is interested please take a look. >> - >> >> Next steps are to add the feature in the stack and Jack already has a >> WIP implementation into the table metadata class >> >> Delete Compaction (design doc >> <https://docs.google.com/document/d/1-EyKSfwd_W9iI5jrzAvomVw3w1mb_kayVNT7f2I-SUg> >> ) >> >> - >> >> Discussion happening at 5pm ET on 10/21 5-6pm PT for anyone >> interested (meeting link <https://meet.google.com/nxx-nnvj-omx>) >> - >> >> Some more discussion is needed to hone in on a final design choice. >> There are a few options that each have their own pros and cons. >> >> The New Source Interface for Flink (FLIP-27 >> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface> >> ) >> >> - >> >> Eventually everything will move to this new source interface (Kafka >> is already using this and it will be the default in Flink 1.14) >> - >> >> A few PR’s for Iceberg are out there and are pending review and merge >> (may not make the deadline for the next release but that’s ok) >> >> Encryption MVP >> >> - >> >> Just had a recent sync on this and we're currently waiting on a few >> updates to the design >> - >> >> Flesh out how the new pushdown encryption into ORC and Parquet >> will work >> - >> >> Need some people to review the stream based encryption, >> particularly around splitability >> - >> >> A few offline discussions are currently happening and for the >> interface we are expecting a few additional PRs separate from the main >> encryption MVP PR >> >> Python Library Development >> >> - >> >> The high level design discussions have been concluded recently >> - >> >> We’ll delay the top level API discussions until some of the core is >> implemented >> - >> >> We have a collection of issues created and a handful of engineers >> working on it >> >> Iceberg Docsite Refactoring >> >> - >> >> Large refactoring coming for the Iceberg docsite >> - >> >> Versioned docs (In the future need to decide how to represent the >> python versions) >> - >> >> Organized more by the persona of the visitor (Data Engineer, >> Systems Engineer, etc.) >> - >> >> Searchable >> - >> >> Expect a PR from Sam, ready for review by the end of this week or >> early next week >> >> Row-Level Support in the Vectorized Reader (PR #3141 >> <https://github.com/apache/iceberg/issues/3141>) >> >> - >> >> Yufei is working on this and it’s part of the effort for merge on read >> - >> >> PR #3287 <https://github.com/apache/iceberg/pull/3287> is only for >> the position delete in parquet >> - >> >> We should have something ready to add by next week >> >> View Spec (PR #3188 <https://github.com/apache/iceberg/pull/3188>) >> >> - >> >> There was a discussion on if we should we have just the SQL text >> exactly as it was passed to the engine or should we also include the >> parsed >> and analyze plan (includes column resolution). In theory, the resolved SQL >> text should be very useful but it’s usefulness may be limited to certain >> edge cases. >> - >> >> The broader discussion here is: Should we allow having multiple >> dialects (Trino, Spark, etc..) >> - >> >> Adds complexity >> - >> >> Time traveling needs to be considered. What does time traveling a >> view mean? If the underlying table is an Iceberg table we may be able >> to, >> but even that would require “as of” time travel to allow time travel >> across >> multiple tables. >> - >> >> Time traveling schemas needs to be added >> - >> >> Agreement that we should not try to solve everything at once but >> break this into smaller problems. >> - >> >> Let’s keep an eye on upcoming engine features to see if this will be >> implicitly solved and let's also refrain from over engineering this. >> >> That's it! Thanks everyone for the high level of participation and enjoy >> the rest of your week! >> >>