Hey Iceberg Nation, Everyone is welcome to attend syncs. Subscribe to this calendar <https://calendar.google.com/calendar/embed?src=3905d492f1b450ba0712f2ae6afa76eb757f13d85220cc03aa4527885adc5629%40group.calendar.google.com&ctz=Asia%2FShanghai> to receive a notification. Note: This meeting note is backdated as I forgot to post it here earlier. 2023-09-20 (Meeting Recording <https://www.youtube.com/watch?v=MIreG41AabI> ⭕ )
- Highlights - PyIceberg 0.5.0 has been released 🎉🎉🎉 Thanks everyone for contributing! - FileIO has been implemented for iceberg-rust, and the catalog is almost there - Spark 3.5 support was added (Thanks, Anton!) - Added support for distributed planning in Spark (Thanks, Anton!) - Spark will push down system.iceberg functions to scans (Thanks, ConeyLiu!) - Added AES GCM encryption and decryption streams (Thanks, Gidon!) - Added strict metadata cleanup (Thanks, Amogh!) - Vectorized reads for MoR DELETE, UPDATE, MERGE plans - Releases - Iceberg 1.4.0 – milestone with all pending PRs <https://github.com/apache/iceberg/milestone/35> - Spark updates – advisory partition size (PR pending) - Spark versions: 3.1 to 3.5? - Strict metadata cleanup - yes - Use Zstd by default (#8593 <https://github.com/apache/iceberg/pull/8593>) - Flink credential refresh issue (#8555 <https://github.com/apache/iceberg/pull/8555>) - Discussion - Parquet metrics problem from Trino - Defaulting to ResolvingFileIO <https://github.com/apache/iceberg/pull/8272> - Discrepancies around null-counts <https://github.com/apache/iceberg/issues/8598> for lists, maps and structs - Proposal: Introduce deletion vector file to reduce write amplification <https://docs.google.com/document/d/1FtPI0TUzMrPAFfWX_CA9NL6m6O1uNSxlpDsR-7xpPL0/edit#heading=h.f42to8zz3i0> - Nanosecond timestamp & timestamptz - sufficient consensus, next steps? - Adding an explicit validation API to DeleteFiles <https://github.com/apache/iceberg/pull/8525/files> which validates the files exist when committing the delete. - Partition Stats Spec - Encryption update AI-generated chapter summaries: 0:00 <https://www.youtube.com/watch?v=MIreG41AabI&t=0s> Chapter 1 The team discussed the progress and updates in various implementations, including support for HDFS, modifications to schemas, and the addition of Spark 3.5 support. They also mentioned advancements in function pushdown, metadata encryption, and vectorized reads for merge commands. 11:36 <https://www.youtube.com/watch?v=MIreG41AabI&t=696s> Chapter 2 Anton and Brajesh discussed the need to change the behavior of Spark versions and the default file sizes in Iceberg. They also considered reducing the number of supported Spark versions and highlighted the work on strict metadata cleanup by MOGS. 17:23 <https://www.youtube.com/watch?v=MIreG41AabI&t=1043s> Chapter 3 The team discussed the implementation of Strict Cleanup in Hive to prevent file corruption and agreed to turn it on by default. They also discussed using Z standard by default for new tables and made changes to the table metadata object and the rest catalog to accommodate this. 23:08 <https://www.youtube.com/watch?v=MIreG41AabI&t=1388s> Chapter 4 The team discussed an issue with incorrect iceberg metadata coming from Parquet files, where min and max values were not being truncated properly. They concluded that there was limited action they could take on the iceberg side and that the underlying issue stemmed from Parquet stats not adhering to the iceberg spec. 29:11 <https://www.youtube.com/watch?v=MIreG41AabI&t=1751s> Chapter 5 The team discussed various issues related to metadata and file IO implementation. They considered fixing bugs in iceberg, exploring defaulting to resolving FileIO, and addressing discrepancies in null counts for lists, maps, and structs. 35:00 <https://www.youtube.com/watch?v=MIreG41AabI&t=2100s> Chapter 6 The team discussed the need to differentiate between null counts in nested fields and the top-level field, and decided to drop null counts for arrays but keep them for structs. They also considered introducing a delete vector to reduce write amplification, but found the proposal unclear and in need of further clarification. 46:48 <https://www.youtube.com/watch?v=MIreG41AabI&t=2808s> Chapter 7 The team discussed the use of deletion vectors to improve performance, but there were many unknowns and decisions that needed to be made regarding how to maintain and represent them in metadata. They also discussed the implementation of nanosecond timestamps and the potential challenges it may pose for engines like Spark.