Meeting Minutes from 04/13 Iceberg Sync

Sam Redai Fri, 15 Apr 2022 12:21:56 -0700

Hey Iceberg Community,

Here are the minutes and recording from our Iceberg Sync that took place
today on *April 13th, 9am-10am PT*.


Always remember, anyone can join the discussion so feel free to share the
Iceberg-Sync <https://groups.google.com/g/iceberg-sync> google group with
anyone who is seeking an invite. The notes and the agenda are posted
in the live
doc
<https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web>
that's
also attached to the meeting invitation and it's a good place to add items
as you see fit so we can discuss them in the next community sync.

Meeting Recording ⭕
<https://drive.google.com/file/d/1SvJEckioxSITAOGqhNXB2vr_dUkjRrNx/view>

Top of the Meeting Highlights

   -

   View Spec: The view spec has been committed and updated, so we’re ready
   to start implementing this. (Great addition to the format, thanks Anjali
   and the team at Netflix!)
   -

   Encryption: The basics for encryption have been committed. (Thanks
   Gideon! Also thanks to Jack and Russell for the reviews!)
   -

   Python: A lot of progress on the client: literals, schema, bucketing,
   and single value serde have all been added. (Thanks Steve, Sam, and Jun!)
   -

   V2 Tables in Trino: Trino can now read V2 tables! There are a lot of
   great activities in the Trino community around Iceberg support.
   -

   Default Values: LinkedIn is leading the conversation around adding
   support for default values where you can set a default value for a column
   in a table. Please weigh in on the issue and spec proposal since this will
   be a breaking change and so will go into the v3 spec.

Releases

   -

   0.13.2 patch release
   -

      There’s a single blocker here which is the Flink UPSERT bug. This is
      currently being worked on and needs a bit more digging.
      -

   0.14.0 status update
   -

      Runtime jar license updates for new apache http client library
      -

      Drop support for Flink 1.12 (PR
      <https://github.com/apache/iceberg/pull/4551>)
      -

      Snapshot expiration with branches and tags. Amogh is working on this
      and has a few PRs in flight but is generally very close

Snapshot Expiration w/ Branching + Tagging

   -

   Currently, there are two forms of expiration
   -

      Reachability Comparison - compares file reachability trees between
      two snapshots
      -

      Original Expiration - assumes a linear history of the table, so as
      you remove old snapshots, any data file deleted in that snapshot can be
      removed
      -

   Branching lacks visibility on what can actually be removed during
   snapshot expiration without knowledge of the entire metadata tree
   (reachability comparison)
   -

   Consensus: For the initial PR, let’s raise an error when expiring
   snapshots on a table with multiple branches. We can then follow up with the
   algorithm to perform snapshot expiration correctly

Docs Site Navigation Flow Proposal (proposal diagram
<https://drive.google.com/file/d/1ar7MaHkHkOwjTbFxGYTGB_ER8tWxnn99/view?usp=sharing>
and worksheet
<https://docs.google.com/document/d/1Y_PRv6p5oJaxg_68AUia_JHw8P4-AZIu3hP5IH2Cpsw/edit>
)

   -

   This is a proposal for re-organizing the docs site.
   -

   The diagram represents the navigation flow that should capture various
   personas
   -

   Please share any feedback on the diagram or any comments from your
   experience using the Iceberg docs

Default Values

   -

   The target is SQL behavior for default values. This means changing
   default values does not change any actual data.
   -

   Adding a new column with a default value has the value appear for all
   existing records (without actually rewriting previous files)
   -

   New writes after the addition of a column with a default value, should
   write the actual value to the file if a value is not provided from the
   INSERT
   -

   Two kinds of defaults are emerging in the spec
   -

      Initial Default - Defaults used for existing data files written
      before the column existed
      -

      Write Default - Secondary default that’s used when writes occur
      without a value provided
      -

   Initial defaults will be set at table creation and at the same time, the
   value will be set for the write default.
   -

   The initial default cannot be changed but write defaults can be.
   -

   There are a lot of discussions currently happening around the behavior
   here so please chime in on the spec PR #4301
   <https://github.com/apache/iceberg/pull/4301>

Stats and Index File (format
<https://github.com/apache/iceberg-docs/pull/69> and implementation
<https://github.com/apache/iceberg/pull/4537>)

   -

   Stats - For this case we want to keep track of column-level sketches for
   NDVs
   -

   The current proposal is to use Apache DataSketches Theta Sketch
   <https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html>
   -

   Allows us to get an estimate of the number of distinct values in a
   column given a fairly limited size buffer
   -

   Apache DataSketches also has a C++ core library in addition to Java
   -

   The current spec is a footer based file-format for keeping track of
   sketches (KBs to a couple of MBs in size)
   -

   This is a backward-compatible and forward-compatible change since its
   informational
   -

   Please check out the proposal and the design doc by the Trino community
   and leave any comments/feedback


Thanks everyone!

Meeting Minutes from 04/13 Iceberg Sync

Reply via email to