Meeting Minutes from 2023-06-07 Iceberg Sync

Brian Olsen Fri, 09 Jun 2023 15:30:52 -0700

Hi Iceberg Community,
Here are the minutes and recording from our Iceberg Sync. They will now be
posted to the new Apache Iceberg YouTube channel. <
https://www.youtube.com/playlist?list=PLkifVhhWtccwcQrNnjEPxbUPX9Q2eCAPO>


Always remember, anyone can join the discussion so feel free to share the
Iceberg-Sync <https://groups.google.com/g/iceberg-sync> Google group with
anyone seeking an invite.

The notes and the agenda are posted in the Iceberg Sync YouTube description.
<
https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web
>
that's
also attached to the meeting invitation and it's an excellent place to add
items as you see fit so we can discuss them in the following community sync.


Meeting Recording
<https://www.youtube.com/watch?v=2rOm5TOafxU>
⭕ / Meeting Transcript, can be found here in the video <
https://youtu.be/1lm4Wlpy2wU?t=28>

Attendees:
Alex Merced, Ashish Paliwal, Bijan Houle, Brian Olsen, Bryan Keller, Daniel
Weeks, Dennis Huo, Dmitri Bourlatchkov, Fokko Driesprong, Jack Ye,
Jacqueline Yeung, Jiao Yizheng, Jonas Jiang, Namratha Mysore
Keshavaprakash, Rajasekhar Konda, Ryan Blue, Shawn Gordon, Steen
Gundersborg, Steve Z, Vicky Bukta, Wing Yew Poon, mohan vamsi.

Highlights:
- Apache Iceberg 1.3.0 has been released :tada: :partying_face:
- Added encryption key and AAD to Parquet write builders (Thanks, Gidon!)
- Spark 3.4 supports timestamp_ntz (Thanks, Fokko!)
- Rebuilt Spark MERGE file handling (Thanks, Anton!)

Releases:
- PyIceberg 0.4.0 release
- Apache Iceberg 1.3.0

Discussion:
- Adaptive split planning in core and Spark (
https://github.com/apache/iceberg/pul...,
https://github.com/apache/iceberg/pul...)
- Multi-table transactions Catalog API (
https://github.com/apache/iceberg/pul...)
- Incremental scan API (https://github.com/apache/iceberg/pul...)
- Views status update
- Partition stats (https://github.com/apache/iceberg/pul...)
- Metadata deletion and gc.enabled

AI-generated chapter summaries:
0:00 Chapter 1
Brian, Daniel, Dmitri, and Jack discussed various updates and improvements
to different tools and systems, including the release of Apache iceberg
1.3, progress on encryption, the addition of UID and timestamp support in
Spark, and an overhaul of file handling in Spark's merge plan. They also
mentioned the need to discuss these updates further in the train-out
community.

5:33 Chapter 2
Jack suggests sharing their data preparation process with the Spark
community.

5:55 Chapter 3
The group discussed different approaches to split planning for parallelism
in Spark, including adapting split sizes based on the amount of data and
creating larger splits for larger scans. They also considered the right
amount of parallelism and balancing the number of workers with the number
of parents to achieve cost savings.

16:41 Chapter 4
The group discussed various topics including adaptive split planning and
multi-table transactions with extensions to the catalog API. They
considered different approaches and methods for implementation, with Fokko
expressing satisfaction with the new multi-table API.

 27:00 Chapter 5
Jack, Fokko, and others discussed various topics such as multi-table swaps,
incremental change log scan with deletes, partition stats, and publishing
stats. They explored the possibility of putting additional stats like NDVs
in the manifest level, but faced challenges in tracking sketches and
inflating metadata size. They considered options like partition level stats
and table level NDVs, and suggested calculating NDVs for individual
partitions to weigh off in the estimate.

37:35 Chapter 6
Jack and Daniel discussed the importance of tracking statistics like NDV
and in-memory column size at the partition level. They also talked about
the GC-enabled property and how it applies to tables that don't own their
own files.

47:17 Chapter 7
  - The group discussed the issue of metadata file deletion and whether it
should be controlled by GC-enabled. They considered various options,
including leaving all garbage files and having a flag to stop any cleanup,
but ultimately decided that changing the behavior of the library for
something that violates the spec would be problematic.

Thanks! See you all at the next sync!

Meeting Minutes from 2023-06-07 Iceberg Sync

Reply via email to