Meeting Minutes from 11/17 Iceberg Sync

Sam Redai Thu, 18 Nov 2021 16:23:38 -0800

Hi Everyone,

Here are the minutes and video recording from our Iceberg Sync that took
place on November 17th, 9am-10am PT.  Please remember that anyone can join
the discussion so feel free to share the Iceberg-Sync
<https://groups.google.com/g/iceberg-sync> google group with anyone who is
seeking an invite. As usual, the notes and the agenda are posted in the live
doc
<https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web>
that's
also attached to the meeting invitation.


The recording has been shared with the Iceberg sync google group. If you
have any issues accessing it, please let me know!

Meeting Recording ⭕
<https://drive.google.com/file/d/1WEXy3VPgsLRIrjsMrHXVydmm4bbEdQBg/view?usp=sharing>

Top of the Meeting Highlights

   -

   0.12.1 Released! - Thanks to everyone who reviewed the release and
   thanks to Kyle for managing it!
   -

   Spark 3.2 Progress - added support for things like dynamic filtering to
   work with v2 sources as well as a new interface for driving sort-order
   through table properties. The changes here will be key for the merge into
   support with deltas.
   -

      Special thanks to Anton who’s been contributing a lot here!
      -

   Bug fixes
   -

      Avro read path
      -

      Vectorized reader in Spark
      -

   Delete File Compaction - The normal rewrite files compaction can be
   configured to detect too many delete files for a particular data file and
   compact them (Thanks Jack!)


Upcoming 0.13.0 Release

   -

   Iceberg 0.13.0 Release Note Draft
   
<https://docs.google.com/document/d/18yc8_Q6Hpc_r7JSoQO4oswQSHgHxJFDnr6Zif9_tceA/edit#heading=h.9jffz1lgqlib>
   -

   We’re aiming for releasing often so including pending changes in a
   future release is preferred over delaying a release to squeeze it in.
   -

   Spark regressions: For the Spark 3.2 branch, some major changes were
   expected for dynamic filtering and all of the row based commands so MERGE,
   DELETE FROM and UPDATE are missing in the 3.2 branch. We’re currently
   thinking through how to resolve this before the release, such as
   potentially porting them for now.
   -

   A new 0.13.0 milestone will be created soon
   -

   A release candidate can be expected soon, hopefully with the
   resequencing and Alibaba file io changes merged in


Java and Python Catalog Consistency

   -

   On a per catalog implementation basis, it makes sense to keep the
   implementations aligned between the Java and Python clients
   -

   For now, let’s lean on thorough documentation for each catalog type and
   expected behaviors, and then generally look for this consistency during PR
   reviews
   -

   The REST catalog is probably the most suitable for providing a detailed
   catalog specification
   -

   Trying to achieve this consistency shouldn’t hold up any of the python
   development


REST based Catalog

   -

   This provides a very flexible mechanism for creating various types of
   catalogs
   -

   Beyond conforming to the REST API specification, this creates room for a
   lot of variability on how the transactions are implemented server-side


RemoveOrphanFilesAction

   -

   Pull Request #1471 <https://github.com/apache/iceberg/pull/1471>
   -

   Problem Description: Currently in delete orphan files we do a diff of
   valid data files and a listing of the directories. Differences in write
   configuration and the configuration when deleting orphan files can cause
   some orphan files to go undetected.
   -

   This has been discussed before and the conclusion was that we should not
   introduce configurations for ignoring certain components of uris. This
   causes other issues such as ignoring the authority for s3 which ignores the
   bucket in the uri. More complications are introduced when you consider that
   many tables can share a bucket/prefix.
   -

   Follow-up: Let’s try and get a comprehensive list of different scenarios
   and implications


Trino Support for Merge on Read/Write

   -

   There are some serialization concerns here that need to be addressed and
   the current open PRs may get redesigned soon.
   -

   A lot of JSON serialization is being developed as part of the REST
   catalog implementation so that may solve some of the issues here.
   -

   Ideally, serialization can be kept somewhat separate from the rest of
   the code base.
   -

   Schema evolution implications need to be considered here as well.


Thanks everyone!

Meeting Minutes from 11/17 Iceberg Sync

Reply via email to