Hi Iceberg Community, Here are the minutes and recording from our Iceberg Sync. They will now be posted to the new Apache Iceberg YouTube channel. < https://www.youtube.com/playlist?list=PLkifVhhWtccwcQrNnjEPxbUPX9Q2eCAPO>
Always remember, anyone can join the discussion so feel free to share the Iceberg-Sync <https://groups.google.com/g/iceberg-sync> Google group with anyone seeking an invite. The notes and the agenda are posted in the Iceberg Sync YouTube description. < https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web > that's also attached to the meeting invitation and it's an excellent place to add items as you see fit so we can discuss them in the following community sync. Meeting Recording <https://www.youtube.com/watch?v=2rOm5TOafxU> ⭕ / Meeting Transcript, can be found here in the video < https://youtu.be/1lm4Wlpy2wU?t=28> Attendees: Alex Merced, Ashish Paliwal, Bijan Houle, Brian Olsen, Bryan Keller, Daniel Weeks, Dennis Huo, Dmitri Bourlatchkov, Fokko Driesprong, Jack Ye, Jacqueline Yeung, Jiao Yizheng, Jonas Jiang, Namratha Mysore Keshavaprakash, Rajasekhar Konda, Ryan Blue, Shawn Gordon, Steen Gundersborg, Steve Z, Vicky Bukta, Wing Yew Poon, mohan vamsi. Highlights: - Apache Iceberg 1.3.0 has been released :tada: :partying_face: - Added encryption key and AAD to Parquet write builders (Thanks, Gidon!) - Spark 3.4 supports timestamp_ntz (Thanks, Fokko!) - Rebuilt Spark MERGE file handling (Thanks, Anton!) Releases: - PyIceberg 0.4.0 release - Apache Iceberg 1.3.0 Discussion: - Adaptive split planning in core and Spark ( https://github.com/apache/iceberg/pul..., https://github.com/apache/iceberg/pul...) - Multi-table transactions Catalog API ( https://github.com/apache/iceberg/pul...) - Incremental scan API (https://github.com/apache/iceberg/pul...) - Views status update - Partition stats (https://github.com/apache/iceberg/pul...) - Metadata deletion and gc.enabled AI-generated chapter summaries: 0:00 Chapter 1 Brian, Daniel, Dmitri, and Jack discussed various updates and improvements to different tools and systems, including the release of Apache iceberg 1.3, progress on encryption, the addition of UID and timestamp support in Spark, and an overhaul of file handling in Spark's merge plan. They also mentioned the need to discuss these updates further in the train-out community. 5:33 Chapter 2 Jack suggests sharing their data preparation process with the Spark community. 5:55 Chapter 3 The group discussed different approaches to split planning for parallelism in Spark, including adapting split sizes based on the amount of data and creating larger splits for larger scans. They also considered the right amount of parallelism and balancing the number of workers with the number of parents to achieve cost savings. 16:41 Chapter 4 The group discussed various topics including adaptive split planning and multi-table transactions with extensions to the catalog API. They considered different approaches and methods for implementation, with Fokko expressing satisfaction with the new multi-table API. 27:00 Chapter 5 Jack, Fokko, and others discussed various topics such as multi-table swaps, incremental change log scan with deletes, partition stats, and publishing stats. They explored the possibility of putting additional stats like NDVs in the manifest level, but faced challenges in tracking sketches and inflating metadata size. They considered options like partition level stats and table level NDVs, and suggested calculating NDVs for individual partitions to weigh off in the estimate. 37:35 Chapter 6 Jack and Daniel discussed the importance of tracking statistics like NDV and in-memory column size at the partition level. They also talked about the GC-enabled property and how it applies to tables that don't own their own files. 47:17 Chapter 7 - The group discussed the issue of metadata file deletion and whether it should be controlled by GC-enabled. They considered various options, including leaving all garbage files and having a flag to stop any cleanup, but ultimately decided that changing the behavior of the library for something that violates the spec would be problematic. Thanks! See you all at the next sync!