Meeting Minutes from 2023-09-20 Iceberg Sync

Brian Olsen Thu, 26 Oct 2023 14:25:42 -0700

Hey Iceberg Nation,
Everyone is welcome to attend syncs. Subscribe to this calendar
<https://calendar.google.com/calendar/embed?src=3905d492f1b450ba0712f2ae6afa76eb757f13d85220cc03aa4527885adc5629%40group.calendar.google.com&ctz=Asia%2FShanghai>
to receive a notification. Note: This meeting note is backdated as I forgot
to post it here earlier.
2023-09-20 (Meeting Recording <https://www.youtube.com/watch?v=MIreG41AabI>
⭕ )


   -

   Highlights
   -

      PyIceberg 0.5.0 has been released 🎉🎉🎉 Thanks everyone for
      contributing!
      -

      FileIO has been implemented for iceberg-rust, and the catalog is
      almost there
      -

      Spark 3.5 support was added (Thanks, Anton!)
      -

      Added support for distributed planning in Spark (Thanks, Anton!)
      -

      Spark will push down system.iceberg functions to scans (Thanks,
      ConeyLiu!)
      -

      Added AES GCM encryption and decryption streams (Thanks, Gidon!)
      -

      Added strict metadata cleanup (Thanks, Amogh!)
      -

      Vectorized reads for MoR DELETE, UPDATE, MERGE plans
      -

   Releases
   -

      Iceberg 1.4.0 – milestone with all pending PRs
      <https://github.com/apache/iceberg/milestone/35>
      -

         Spark updates – advisory partition size (PR pending)
         -

         Spark versions: 3.1 to 3.5?
         -

         Strict metadata cleanup - yes
         -

         Use Zstd by default (#8593
         <https://github.com/apache/iceberg/pull/8593>)
         -

         Flink credential refresh issue (#8555
         <https://github.com/apache/iceberg/pull/8555>)
         -

   Discussion
   -

      Parquet metrics problem from Trino
      -

      Defaulting to ResolvingFileIO
      <https://github.com/apache/iceberg/pull/8272>
      -

      Discrepancies around null-counts
      <https://github.com/apache/iceberg/issues/8598> for lists, maps and
      structs
      -

      Proposal: Introduce deletion vector file to reduce write amplification
      
<https://docs.google.com/document/d/1FtPI0TUzMrPAFfWX_CA9NL6m6O1uNSxlpDsR-7xpPL0/edit#heading=h.f42to8zz3i0>
      -

      Nanosecond timestamp & timestamptz - sufficient consensus, next steps?
      -

      Adding an explicit validation API to DeleteFiles
      <https://github.com/apache/iceberg/pull/8525/files> which validates
      the files exist when committing the delete.
      -

      Partition Stats Spec
      -

      Encryption update


AI-generated chapter summaries: 0:00
<https://www.youtube.com/watch?v=MIreG41AabI&t=0s> Chapter 1 The team
discussed the progress and updates in various implementations, including
support for HDFS, modifications to schemas, and the addition of Spark 3.5
support. They also mentioned advancements in function pushdown, metadata
encryption, and vectorized reads for merge commands. 11:36
<https://www.youtube.com/watch?v=MIreG41AabI&t=696s> Chapter 2 Anton and
Brajesh discussed the need to change the behavior of Spark versions and the
default file sizes in Iceberg. They also considered reducing the number of
supported Spark versions and highlighted the work on strict metadata
cleanup by MOGS. 17:23 <https://www.youtube.com/watch?v=MIreG41AabI&t=1043s>
Chapter 3 The team discussed the implementation of Strict Cleanup in Hive
to prevent file corruption and agreed to turn it on by default. They also
discussed using Z standard by default for new tables and made changes to
the table metadata object and the rest catalog to accommodate this. 23:08
<https://www.youtube.com/watch?v=MIreG41AabI&t=1388s> Chapter 4 The team
discussed an issue with incorrect iceberg metadata coming from Parquet
files, where min and max values were not being truncated properly. They
concluded that there was limited action they could take on the iceberg side
and that the underlying issue stemmed from Parquet stats not adhering to
the iceberg spec. 29:11
<https://www.youtube.com/watch?v=MIreG41AabI&t=1751s> Chapter 5 The team
discussed various issues related to metadata and file IO implementation.
They considered fixing bugs in iceberg, exploring defaulting to resolving
FileIO, and addressing discrepancies in null counts for lists, maps, and
structs. 35:00 <https://www.youtube.com/watch?v=MIreG41AabI&t=2100s>
Chapter 6 The team discussed the need to differentiate between null counts
in nested fields and the top-level field, and decided to drop null counts
for arrays but keep them for structs. They also considered introducing a
delete vector to reduce write amplification, but found the proposal unclear
and in need of further clarification. 46:48
<https://www.youtube.com/watch?v=MIreG41AabI&t=2808s> Chapter 7 The team
discussed the use of deletion vectors to improve performance, but there
were many unknowns and decisions that needed to be made regarding how to
maintain and represent them in metadata. They also discussed the
implementation of nanosecond timestamps and the potential challenges it may
pose for engines like Spark.

Meeting Minutes from 2023-09-20 Iceberg Sync

Reply via email to