Iceberg Community Meetings are open to everyone. To receive an invitation to the next meeting, please join the iceberg-s...@googlegroups.com <https://groups.google.com/g/iceberg-sync> list. Notes from previous meetings along with a running agenda for the next meeting are available here: https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?pli=1#heading=h.z3dncl7gr8m1
21 July 2021 - Releases - 0.12 Release status - Currently blocked on “Handle the case that RewriteFiles and RowDelta commit the transaction at the same time” #2308 <https://github.com/apache/iceberg/issues/2308>. Ryan is working on a fix. - Consider dropping support for Spark 3.0 and 3.1 after 0.12 once Spark 3.2 is available - Spark 3.2 is set to include many changes to DSv2 which we can leverage to make our code simpler. Examples include eliminating the need to provide our own distribution and sort ordering utils for Spark, and the ability to deal with Spark expressions directly instead of via Iceberg wrapper code. - Should we just cut support for 3.0 and 3.1, and instead just do 3.2 support in the next release, in order to avoid doing a three-way version split, which currently looks like it would require an additional Spark module that is 3.2 specific. - [Anton] This is not just about the tech debt added by shims. It’s also about not being able to use certain Spark APIs that have been introduced in new versions. For example, in 3.1 there is the purge flag, as well as APIs in structured streaming related to limit support. In 3.2 there is the distribution and ordering support. I’m in favor of keeping it simple and release 0.12 with the release for all Spark versions, and then migrate to Spark 3.2 in the next version of Iceberg. - [Ryan] To recap, the main issue is that we would need to bump the Spark version to 3.2 in order to pull in the new interfaces, and then when you roll back, and you use that same module in our 3.1, the interfaces are missing, so we can't actually load them. I think we may be able to solve this problem by not loading the interface until it is actually needed. In other words, have a method on the object that is from 3.2, and then basically copy the object and mix in the interface at that point. Sometimes you can get away with having an extra class in there, but not loading the part of it actually depends on the missing interface. That sometimes you can get away with, like, having an extra class in there. I’ll do some testing and see if I can get this working between Spark 3.2 and 3.1. - Conclusion: keep this discussion open for a bit longer while Ryan does some exploration to see if his approach is viable. - Slack community - [Ryan] At the last meeting we discussed ways of making it easier for community members to join the Iceberg channel on the ASF’s Slack workspace. The discussion was tabled when it became known that there’s a self-invite link. Unfortunately, it turns out the link regularly breaks and the ASF INFRA team has declined to fix it this time because of an influx of spammers. Carl created a separate Slack workspace dedicated to Apache Iceberg. I think we should migrate to this space since making it easy for everyone to join and enter the discussion is more important and leveraging the existing ASF infrastructure. Since I’m seeing lots of +1s for this on the chat I think the next step is to raise this issue on the dev list. (related thread <https://lists.apache.org/thread.html/r4a23572882f421944ed545f5d7dd798b3580c120e7e246a3f604cfcf%40%3Cdev.iceberg.apache.org%3E>, Slack invite link <https://join.slack.com/t/apache-iceberg/shared_invite/zt-tlv0zjz6-jGJEkHfb1~heMCJA3Uycrg> ) - Addendum: On the dev list thread we decided to move to the apache-iceberg Slack workspace. - Bucketing with Unicode characters (#2837 <https://github.com/apache/iceberg/issues/2837>) - Mateusz Gajewski at Starburst discovered that Iceberg’s bucket hash function for Strings generates values that don’t adhere to the Iceberg spec when the input String contains Unicode surrogate pair characters <https://docs.microsoft.com/en-us/globalization/encoding/surrogate-pairs#:~:text=With%20surrogate%20pairs%2C%20a%20Unicode,over%20one%20million%20additional%20characters.>. The root cause of this issue is a bug in Guava’s Hashing.murmur3_32().hashString method <https://github.com/google/guava/issues/5648>. - It’s easy to work around this issue in Iceberg by using murmur3_32().hashBytes in place of murmur3_32().hashString, but what do we need to do to help users who potentially have existing data stored this way? - Two approaches were discussed: (1) provide a compatibility mode that would produce both bucket hash values or fallback to the old behavior, and (2) provide a Spark action that users can use to fix their data. People felt (1) was risky on account of lots of potential corner cases, and Ryan noted that (2) is something we need to invest in any way to help people migrate from one partitioning scheme to another. - Conclusion: (1) document how to correct the data using MERGE INTO, (2) fix the bucket function, and (3) add a Spark action for correcting the data. - Z-Ordering - Bhavyam Kamal presented his proposal <https://docs.google.com/document/d/1UfGxaB7qlrGzzMk9pBm03oKPOkm-jk-NQVQQvHP-0Bc/edit> for adding Z-Ordering to Iceberg and demoed his prototype implementation. Z-Ordering is a technique for clustering data in multiple dimensions to create mutually exclusive data files, which then results in more efficient file pruning when applying predicates. - Conclusion: The plan going forward is to split the work into two phases: - 1) Implement merge sort based compaction and allow compaction/rewrite of data files using a space filling curve based sort. No planning or persisting of metrics. - 2) Support for Transforms with multiple arguments and possible parameterization, store metrics for curve values in datafile metrics along with transform used when writing file, and modify query planning to use these metrics. - - We ran out of time before getting to the following topics: - APIs deprecated in 0.11 and scheduled for removal in 0.12 - Relative paths in the metadata (design doc <https://docs.google.com/document/d/1RDEjJAVEXg1csRzyzTuM634L88vvI0iDHNQQK3kOVR0> ) - JSON metadata location - Source of truth for table roots - Is there an alternative that supports use cases better? - Sort Ordering - Secondary Indexes - Commit message format and PR description template - Manifest V2 Discuss Thread