Meeting Minutes from 10/20 Iceberg Sync

Sam Redai Thu, 21 Oct 2021 08:30:22 -0700

Good Morning Everyone,

Here are the minutes from our Iceberg Sync that took place on October 20th,
9am-10am PT. Please remember that anyone can join the discussion so feel
free to share the Iceberg-Sync <https://groups.google.com/g/iceberg-sync>
google group with anyone who is seeking an invite. As usual, the notes and
the agenda are posted in the live doc
<https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web>
that's
also attached to the meeting invitation.


We covered a lot of topics...here we go!:

Top of the Meeting Highlights

   -

   Sort based compaction - This is finished, reviewed, and merged. When you
   compact data files, you can now also have Spark re-sort them, either by the
   table’s sort order or the sort order given when you create the compaction
   job.
   -

   Spark build refactor: Thank you to Jack for getting us started on the
   Spark build refactor and also thanks to Anton for reviewing and helping get
   these changes in. We’ve gone with a variant of option 3 from our last
   discussions where we include all of the spark modules in our build but make
   it easy to turn them off. This way we can get the CI to run Spark, Hive,
   and Flink tests separately and only if necessary.
   -

   Delete files implementation for ORC: Thanks to Peter for adding builders
   to store deletes in ORC (previously we could only store deletes in Parquet
   or Avro). This means we now have support for all 3 formats for this feature.
   -

   Flink Update: We’ve updated Flink to 1.13 so we’re back on a supported
   version. 1.14 is out this week so we can aim to move to that at some point.

Iceberg 0.12.1 Upcoming Patch Release (milestone
<https://github.com/apache/iceberg/milestone/15?closed=1>)

   -

   Fix for the parquet map projection bug
   -

   Fix Flink CDC bug
   -

   A few other fixes that we also want to get out to the community so we’re
   going to start a release candidate as soon as possible
   -

   Kyle will start a thread in the general slack channel so everyone please
   feel free to mention any additional fixes that they want to see in this
   patch release

Snapshot Releases

   -

   Eduard will tackle adding snapshot releases
   -

   In our deploy.gradle file, it’s setup to deploy to the snapshot
   repository
   -

   May require certain credentials so it may be required to reach out to
   the ASF infrastructure team

Iceberg 0.13.0 Upcoming Release

   -

   There’s agreement to switch to a time based release schedule so the next
   release is roughly mid-November
   -

   Jack will cut a branch close to that time and any features that aren’t
   in yet will be pushed to the next release
   -

   We agree not to hold up releases to squeeze features in and prefer
   instead to aim for releasing sooner the next time

Adding v3.2 to Spark Build Refactoring

   -

   Russell and Anton will coordinate on dropping in a Spark 3.2 module
   -

   We currently have 3.1 in the `spark3` module. We’ll move that out to its
   own module and mirror what we do with the 3.2 module. (This will enable
   cleaning up some mixed 3.0/3.1 code)

Merge on Read

   -

   Anton has a bunch of PRs ready to queue up to contribute their internal
   implementation. (Russell will work with him)
   -

   This feature will allow for a much lower write amplification
   -

   The expectation is that in Spark 3.3 we can rely on Spark’s internal
   merge on read

Snapshot Tagging (design doc
<https://docs.google.com/document/d/1PvxK_0ebEoX3s7nS6-LOJJZdBYr_olTWH9oepNUfJ-A/edit>)
(PR #3104 <https://github.com/apache/iceberg/pull/3104>)

   -

   We just had a meeting on Monday about that and made some conclusions and
   designs, so anyone who is interested please take a look.
   -

   Next steps are to add the feature in the stack and Jack already has a
   WIP implementation into the table metadata class

Delete Compaction (design doc
<https://docs.google.com/document/d/1-EyKSfwd_W9iI5jrzAvomVw3w1mb_kayVNT7f2I-SUg>
)

   -

   Discussion happening at 5pm ET on 10/21 5-6pm PT for anyone
interested (meeting
   link <https://meet.google.com/nxx-nnvj-omx>)
   -

   Some more discussion is needed to hone in on a final design choice.
   There are a few options that each have their own pros and cons.

The New Source Interface for Flink (FLIP-27
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface>
)

   -

   Eventually everything will move to this new source interface (Kafka is
   already using this and it will be the default in Flink 1.14)
   -

   A few PR’s for Iceberg are out there and are pending review and merge
   (may not make the deadline for the next release but that’s ok)

Encryption MVP

   -

   Just had a recent sync on this and we're currently waiting on a few
   updates to the design
   -

      Flesh out how the new pushdown encryption into ORC and Parquet will
      work
      -

      Need some people to review the stream based encryption, particularly
      around splitability
      -

   A few offline discussions are currently happening and for the interface
   we are expecting a few additional PRs separate from the main encryption MVP
   PR

Python Library Development

   -

   The high level design discussions have been concluded recently
   -

   We’ll delay the top level API discussions until some of the core is
   implemented
   -

   We have a collection of issues created and a handful of engineers
   working on it

Iceberg Docsite Refactoring

   -

   Large refactoring coming for the Iceberg docsite
   -

      Versioned docs (In the future need to decide how to represent the
      python versions)
      -

      Organized more by the persona of the visitor (Data Engineer, Systems
      Engineer, etc.)
      -

      Searchable
      -

   Expect a PR from Sam, ready for review by the end of this week or early
   next week

Row-Level Support in the Vectorized Reader (PR #3141
<https://github.com/apache/iceberg/issues/3141>)

   -

   Yufei is working on this and it’s part of the effort for merge on read
   -

   PR #3287 <https://github.com/apache/iceberg/pull/3287> is only for the
   position delete in parquet
   -

   We should have something ready to add by next week

View Spec (PR #3188 <https://github.com/apache/iceberg/pull/3188>)

   -

   There was a discussion on if we should we have just the SQL text exactly
   as it was passed to the engine or should we also include the parsed and
   analyze plan (includes column resolution). In theory, the resolved SQL text
   should be very useful but it’s usefulness may be limited to certain edge
   cases.
   -

   The broader discussion here is: Should we allow having multiple dialects
   (Trino, Spark, etc..)
   -

      Adds complexity
      -

      Time traveling needs to be considered. What does time traveling a
      view mean? If the underlying table is an Iceberg table we may be able to,
      but even that would require “as of” time travel to allow time
travel across
      multiple tables.
      -

      Time traveling schemas needs to be added
      -

   Agreement that we should not try to solve everything at once but break
   this into smaller problems.
   -

   Let’s keep an eye on upcoming engine features to see if this will be
   implicitly solved and let's also refrain from over engineering this.

That's it! Thanks everyone for the high level of participation and enjoy
the rest of your week!

Meeting Minutes from 10/20 Iceberg Sync

Reply via email to