Meeting Minutes from 2023-10-11 Iceberg Sync

Brian Olsen Thu, 26 Oct 2023 14:25:57 -0700

Hey Iceberg Nation,
Everyone is welcome to attend syncs. Subscribe to this calendar
<https://calendar.google.com/calendar/embed?src=3905d492f1b450ba0712f2ae6afa76eb757f13d85220cc03aa4527885adc5629%40group.calendar.google.com&ctz=Asia%2FShanghai>
to receive a notification. Note: This meeting note is backdated as I forgot
to post it here earlier.


2023-10-11(Meeting Recording <https://youtu.be/euWtAKo_bV4> ⭕ )

   -

   Highlights
   -

      1.4.0 was released! (Thanks, Anton!)
      -

         v2 and zstd defaults
         -

         Advisory partition size in Spark
         -

         Skip local sort for unordered writes in Spark
         -

         Distributed planning in Spark
         -

         AzureFileIO
         -

         Multi-table commits through REST
         -

         Removed Spark 3.1
         -

      Python moved to the iceberg-python
      <https://github.com/apache/iceberg-python> repo (removed from main)
      -

      Flink  alter table column support  was added
      <https://github.com/apache/iceberg/pull/7628> (1.17 only), like
      adding a new column, changing column position (Thanks, Yanghao Lin)
      -

      Metastore catalog support for views was added (Thanks, Eduard!)
      -

      Close to write support in Python, supports v1 and v2 metadata
      (Thanks, Fokko!)
      -

      Rust added read support for manifest lists (Thanks, ZENOTME)
      -

      Spark: clean up FileIO resources on executors (Thanks, Anton!)
      -

   Discussion
   -

      PR commit methods – standardize on squash?
      -

      Iceberg docs refactor <https://github.com/apache/iceberg/pull/8659> (try
      me <https://github.com/bitsondatadev/iceberg/tree/new-docs/docs-new>)
      -

      Spec v3 changes:
      -

         New types
         -

            BLOB
            -

            BSON/JSON
            -

            Timestamp{tz}_{ns,ms}
            
<https://docs.google.com/document/d/1bE1DcEGNzZAMiVJSZ0X1wElKLNkT9kRkk0hDlfkXzvU/edit>
            (not millis)
            -

            FLOAT16?
            -

         Default values
         -

         Type promotion
         -

            * to string (choose a format)
            -

               What are the use cases for changing the type?
               -

               int/long to string
               -

               float/timestamp - why?
               -

               Bool to string should be allowed
               -

            Long to timestamp (must be millis)
            -

         Multi-column transforms
         -

            Bucket v2
            -

            Geo?
            -

         Location/path requirements (recommendations)
         -

         Owned locations (discussion
         <https://lists.apache.org/thread/3fx8povnsq0f4g1xzj38snplr6d3ch1r>)
         -

         Delete vectors (discussion
         <https://lists.apache.org/thread/gr3g5rrr60fhvy0mrdj4s4w9x8c3v58g>)
         -

         Allowing relative paths
         -

      Partition stats spec and discussion in PR 7105
      <https://github.com/apache/iceberg/pull/7105>.

Kafka Connect (discussion
<https://lists.apache.org/thread/d9h22z2ydcpvjxp53yl6w96xoy3dp33h>)

AI-generated chapter summaries: 0:00
<https://www.youtube.com/watch?v=euWtAKo_bV4&t=0s> Chapter 0 Introduction
5:14 <https://www.youtube.com/watch?v=euWtAKo_bV4&t=314s> Chapter 1
Highlights Ryan thanks Anton for releasing v1.4 with many bug fixes and
changes, including defaulting to v2 format and Z standard for data
compression. Azure file IO is now available, with native support for
multi-table commits in Spark. pyIceberg project moved to a new repository
and new Python support was added. 12:01
<https://www.youtube.com/watch?v=euWtAKo_bV4&t=721s> Chapter 2 PR commit
methods and repository setup. Anton highlights recent improvements in
Spark, including file cleanup and manifest file read support and plans to
discuss spec v3 changes with the community. The group discusses PR commit
methods, suggesting standardizing across repositories to use squash and
merge by default, rather than merge commits. There was concern about
enforcing linear history on the Java side, citing potential issues with
rebase and time zones. One suggestion was bringing the issue of
inconsistent commit messages to the community for resolution. A consensus
is built around squashing commits to make them more meaningful and easier
to understand. 19:30 <https://www.youtube.com/watch?v=euWtAKo_bV4&t=1170s>
Chapter 3 Improving Iceberg Docs with a mono repo. Brian is refactoring the
iceberg documentation to move it back into the main iceberg repo,
simplifying maintenance and improving collaboration. He proposes to create
a single documentation site containing the static site and for all versions
of docs, solving problems with multiple sources and making releases easier.
The plan is to merge an initial PR and build consensus, then replace the
current ASF documentation branch and repoint it back to the main repo.
We're creating a nightly branch for documentation changes, and maintaining
it as an up-to-date snapshot. The readme file on the branch will have all
the necessary information for building and understanding the project. 26:59
<https://www.youtube.com/watch?v=euWtAKo_bV4&t=1619s> Chapter 4 V3 spec
changes for data storage. The team discusses v3 spec changes, including
partition stats, which may not be included in v3 due to a lack of need for
backward compatibility. If partition stats are required for v3, it would
need to be decided and implemented separately from the main v3 discussion.
Everyone should be aware that multi-column transforms are a v3-only change
and are likely to break in v2. There are also some potential
forward-breaking changes for Hadoop v3, including location path
requirements and Delete vector proposal. 34:39
<https://www.youtube.com/watch?v=euWtAKo_bV4&t=2079s> Chapter 5 Metadata
requirements for Iceberg V3

Meeting Minutes from 2023-10-11 Iceberg Sync

Reply via email to