Hello Iceberg Community, Below you can find the minutes and video recording from our Iceberg Sync that took place on *January 19th, 9am-10am PT*.
Always remember, anyone can join the discussion so feel free to share the Iceberg-Sync <https://groups.google.com/g/iceberg-sync> google group with anyone who is seeking an invite. The notes and the agenda are posted in the live doc <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web> that's also attached to the meeting invitation and it's a good place to add items as you see fit so we can discuss them in the next community sync. Minutes: Meeting Recording ⭕ <https://drive.google.com/file/d/1adbE5ichvfM3_-v2mcH16k1NMiCNjW1O/view> Top of the Meeting Highlights - Spark 3.2 merge-on-read DELETE is in! (Thanks, Anton!) - Added dynamic file filter metrics in Spark 3.0 and 3.1 (Thanks, Chen!) - Iceberg-spark metrics are now visible in the Spark UI - Candidate files that were scanned - Matching files that went on into subsequent stages such as merge or update plans - This is useful to check out as an example of how to easily add metrics to some of our custom plans and have them show up in the Spark UI - Added initial OpenAPI spec for REST catalogs (Thanks, Kyle!) - The spec for the REST catalog has been shared and lots of community members have been looking closely at it. - This will have a big impact, particularly on SaaS implementations, so the more feedback from the community on this the better. 0.13.0 Release Candidate - Spark 3.2 merge-on-read DELETE is in! (Thanks, Anton!) - Sort order in relation to copy-on-write merge might have a few remaining items but it feels ready to merge now and improve in upcoming releases if necessary. - Parquet support for the older 2-level list style: If we could get that into the 0.13.0 release, that would be great although it’s not a blocker and we can follow up with a quick subsequent release. PR #3774 <https://github.com/apache/iceberg/pull/3774> - Checksum validation on S3 requests is another relevant open pull request. This is also not a blocker for 0.13.0 but is close enough that we may want to include it. PR #3813 <https://github.com/apache/iceberg/pull/3813> - Nessie-related changes for the 0.13.0 release are merged and ready. Python - A lot of interest in the community and among our user base - Some ongoing discussions on how to get review cycles shortened - Types PR was merged yesterday and there are a couple of current PRs that are very near merge-ready - Open invitation to anyone interested in participating in the python refactoring efforts (either contributing or reviewing) so check out the Iceberg Python Sync <https://groups.google.com/g/iceberg-python-sync> if you’re interested! Java 1.0 API - Managing delete files is a big request for the 1.0 API. We’ve recently added a delete file threshold to the rewrite files action which drives how many delete files will remain. You can also set this to 0 to rewrite all delete files. This suffices for removing delete compactions from the remaining high priority items list. - Other potential remaining items to consider - Expiring delete files that are no longer used by the current snapshots for a boost to storage efficiency. This functionality could simply be added to the expire snapshot action. - No other high-priority maintenance operations seem to be remaining for the 1.0 release - More discussion is needed around what should be public/private and how we’ll evolve the API over time. - Currently, the target is for the next Java API release to be 1.0 Other High Priority Items: Alternative File Formats - Ashish has been taking a look at this and it seems very doable. Current formats (ORC, Parquet, etc.) share very similar interfaces that can inform the abstraction. Encryption - This is a high-priority item that’s in demand and can hopefully get in soon. There are a few PR’s open and community review is welcome. CDC - Not necessarily high priority but there’s a strong desire to get this in this calendar year. Refreshing materialized views in spark is highly dependent on this functionality. Yufei is working on a design doc. Tagging Snapshots and Searching Snapshots by Tag - This is useful for allowing tags to be exposed to users for easily retrieving a previous table state. In particular, tagging snapshots on an hourly or daily basis for future convenient lookups. Z-Ordering - Significant progress has been made on Z-Ordering and one of the current discussions is around its implementation with respects to the magnitude problem. Specifically, normalizing values may require metrics/stats on column distributions. - In its current state, Apple’s implementation doesn’t include any stats on distribution but is valuable as an initial implementation to get in to unblock the feature and get it out to the community while solving the magnitude problem in a subsequent release. REST Catalog - There’s a lot of interest in the REST catalog and there are at least a few areas of the community that are ready to use it immediately, i.e. Nessie and Apple - Future talks about pushing more work into the server implementations: - Should planning be a part of the REST catalog API? - Pagination mechanism for accessing all of the snapshots in a table? Relative Paths (design doc <https://docs.google.com/document/d/1RDEjJAVEXg1csRzyzTuM634L88vvI0iDHNQQK3kOVR0/edit#heading=h.hxmtkjthp8hm> ) - High priority for a few community members, primarily Apple. - Many use cases for this feature across the community (surrounding data recovery) Views - LinkedIn and Netflix are interested in this effort and picking it back up - A question is how we’d be able to integrate this back into Spark after it’s added on the Iceberg side Secondary Indexes - A priority for the Athena team but if other members find this as a high priority please reach out and share details around your use cases. - Might be useful for cases where you can’t order the data as it’s written to the table. Also, cases where you need to index by an additional column that’s not included in your sort order. Handling Wide Tables (many columns) - In such a scenario, the metadata files can get very large given that we store metrics for each column. The MetricsConfig is key for optimizing this. - See table configuration <https://iceberg.apache.org/configuration/> for more details - In particular, setting `write.metadata.metrics.default` to none or tuning this at the column level using ` write.metadata.metrics.column.col1` Thank you all for another great meeting!