Meeting Minutes from 05/25 Iceberg Sync

Sam Redai Thu, 26 May 2022 09:08:04 -0700

Hey Iceberg Community,

Here are the minutes and recording from our Iceberg Sync that took place
today on *May 25th, 9am-10am PT*.


Always remember, anyone can join the discussion so feel free to share the
Iceberg-Sync <https://groups.google.com/g/iceberg-sync> google group with
anyone who is seeking an invite. The notes and the agenda are posted
in the live
doc
<https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web>
that's
also attached to the meeting invitation and it's a good place to add items
as you see fit so we can discuss them in the next community sync.

Meeting Recording ⭕
<https://drive.google.com/file/d/1FISvrM3eEWQuIfQnZAKbRN0Xyk2hH36_/view?usp=sharing>


Top of the Meeting Highlights

   -

   Added an incremental append scan interface (Thanks, Steven!)
   -

   Backports for 0.13.2 are done (Thanks, Eduard!)
   -

   API validation using revapi was added (Thanks, Kyle!)
   -

   Added all_files and all_delete_files metadata tables (Thanks, Szehon!)

Releases

   -

   0.13.2
   -

      All of the backports are merged
      -

         Milestone <https://github.com/apache/iceberg/milestone/18?closed=1>
         with merged PRs
         -

   1.0.0 (no 0.14 release)
   -

      LICENSE updates done
      -

      API checking is done
      -

      Incremental snapshot expiration still pending
      -

      Metadata table schema guarantees
      -

      An alternative option is to release an 0.14 with a quick follow-up
      1.0.0 release that removed any deprecations

Agenda

   -

   Minimum supported python version changed from 3.7 to 3.8
   -

   Proposal to change from tox to pre-commit: PR #4811
   <https://github.com/apache/iceberg/pull/4811>
   -

   Change scan: PR #4870 <https://github.com/apache/iceberg/pull/4870>
   -

   Incremental Scans
   -

      Most CDC operations require excluding rows that are unchanged or join
      rows by ID to create a pre-image/post-image
      -

      Some difficulties around using DataSourceV2 (getting pure
      deleted/inserted requires shuffling)
      -

      One option is defining a view that uses incremental scans to do a
      pre-image/post-image analysis (View catalog has not been added
to Spark yet
      but there’s an existing SPIP-31357
      <https://issues.apache.org/jira/browse/SPARK-31357> and PR #35636
      <https://github.com/apache/spark/pull/35636>)
      -

   Puffin - new name for Index and Stats file-format
   -

      Secondary index metadata as blobs of binary data
      -

      Theta sketches
      -

   Snapshot branching and tagging syntax
   -

      Option 1: `<database>.<table>.<branch_name>`
      -

         Potential conflicts with metadata table names (possibly a rare
         occurence)
         -

      Option 2: Qualifying prefixes such as `branch$<branch_name>` or
      `tag:<tag_name>`.
      -

         No standardized way in SQL to specify a tag or branch, which could
         delay implementation upstream
         -

      Option 3: An option/context setting
      -

         Setting options currently not possible in spark-sql
         -

      Before implementing any of this logic, let’s work through a proposal

Thanks everyone!
-- 

Sam Redai <s...@tabular.io>

Developer Advocate  |  Tabular <https://tabular.io/>

Meeting Minutes from 05/25 Iceberg Sync

Reply via email to