Meeting Minutes from 12/08 Iceberg Sync

Sam Redai Tue, 14 Dec 2021 14:10:20 -0800

Hi Everyone,

Here are the minutes and video recording from our Iceberg Sync that took
place on December 8th, 9am-10am PT.  Please remember that anyone can join
the discussion so feel free to share the Iceberg-Sync
<https://groups.google.com/g/iceberg-sync> google group with anyone who is
seeking an invite. As usual, the notes and the agenda are posted in the live
doc
<https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web>
that's
also attached to the meeting invitation.


Meeting Recording ⭕
<https://drive.google.com/file/d/1cLg8bc1JTslalYpvd5AF3OYN7U_ixO_x/view>

Top of the Meeting Highlights

   -

   Flink 1.14 Support: This has recently been merged in and we’re using the
   same mechanism for the spark versions, so this was a copy over of the Flink
   1.13 Iceberg runtime and updated for 1.14 support.
   -

   Rewrite Data Files Stored Procedure: rewrite data files procedures can
   now be called from Spark-SQL, including with filters. (Thanks Ajantha!)
   -

   OSSFileIO: This has been added which is super helpful for many users in
   China using OSS storage
   -

   Resequencing Data Files during Rewrites: This allows us to keep equality
   deletes flowing into a  table and perform rewrites without conflicts.
   (Thanks Jack!)


Upcoming 0.13.0 Release

   -

   Iceberg 0.13.0 Release Note Draft
   
<https://docs.google.com/document/d/18yc8_Q6Hpc_r7JSoQO4oswQSHgHxJFDnr6Zif9_tceA/edit#heading=h.9jffz1lgqlib>
   -

   Current biggest blocker is Spark 3.2 feature regressions
   -

      The plan is to release with just Copy-On-Write and follow-up with
      addressing any 3.2 regressions. This is to avoid holding up
releases and we
      can always do an out of cycle release soon

Flink State Issue

   -

   A field was added to the middle of datafile so that it can be added to
   the files table. The mismatch of the read and write schema causes an issue
   with Flink because of the assumption that the read and write schemas are
   identical. This prevents recovering from older checkpoints and so is
   something that should be fixed as soon as possible. (Added to the 0.13
   milestone)

OSS Runtime Bundle

   -

   Runtime bundles require a large amount of engineering bandwidth to focus
   on very specific packaging details, particularly around licensing. It makes
   sense to do this for the Spark and Flink runtimes but if vendors need other
   specific bundles, the recommended direction should be for them to build and
   maintain their own runtime bundle.

HadoopCatalog

   -

   For filesystem-only tables to be safe, some kind of lock implementation
   will be needed. This can be facilitated by a lock manager API.
   -

   A disclaimer should be included to discourage using this catalog
   implementation for production workloads

Iceberg Event Notification Support

   - Start supporting event notifications for Iceberg
   - Can facilitate hooking up to SNS, SQS, Cloudwatch, etc
   - Should event listeners be configured at the catalog level or the table
   level?
   - It seems like the catalog level is the better path forward (something
   similar to how FileIO is implemented). There may exist use-cases out there
   for table-level configuration but let’s wait for those to appear.

REST catalog spec: Multipart namespaces

   -

   The Iceberg catalog allows multipart namespaces, originally to support
   hadoop tables where a directory can serve as a namespace part
   -

   Not all engines support nested catalog names
   -

   How should we represent this in REST paths? (Slashes vs. Dots to
   separate namespace parts)
   -

   Dots between namespace parts seems to be the path forward
   - Easy to encode dots in the name
      - Theoretically could have collisions, a.b could be “b nested within
      a” or”a column named a.b”. However, Iceberg already guards
against this so
      it shouldn’t be an issue.



Thank you!

Meeting Minutes from 12/08 Iceberg Sync

Reply via email to