Hi everyone, Here is a doc for upcoming agendas and notes from the community sync: https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=sharing
Everyone should be able to comment using that link and a google account. If you'd like to add agenda items, please request editor access and I'll add you. As discussed in the sync, we'll try to use that as a running log. The notes may be a bit incomplete, since I accidentally closed the text editor where I was taking notes without saving and the discussion is from memory. Feel free to comment and fix it! I'm also going to copy the notes below: 27 May 2020 - Agenda doc - Dan: Is there an Agenda for this meeting? - Ryan: there is one in the invite, but it was empty - Dan: we should use a doc for notes and the next Agenda - Guava is now shaded in iceberg-bundled-guava. Please let us know if you hit problems. - Update the build from gradle-consistent-versions - Ryan: We currently use gradle-consistent-versions. That’s great in some ways, but doesn’t allow us to have multiple versions of the same dependency in different modules, which blocks adding a Spark 3 module to master. It is also blocking the MR/Hive work because Hive uses a different Guava version. Even though we shade to avoid the conflict, the plugin can only support 1 version. - Ryan: I’ve gone through options and the best option looks like the Nebula plugins (maintained by Netflix). There is a PR open to move to these, #1067 <https://github.com/apache/iceberg/pull/1067>. Please have a look at the PR and comment! - Ratandeep: What are the drawbacks of the one built into gradle? - Ryan: It doesn’t use versions.props, so changes are larger; use is awkward because locking is a side-effect of other tasks; much newer and requires bumping to a new major version of gradle. And, I can ask the Nebula team for support if we use those modules. - Bloom filters for GDPR use cases - Miao: For GDPR data requests, we need to scan tables for specific IDs. Even batching the requests together to minimize the number of scans, this is very expensive for large tables. Matching records are usually stored in just a few files, so keeping bloom filters for ID columns reduces cost significantly. - Ryan: Why doesn’t using partitioning help? We normally recommend bucketing or sorting by ID columns. - Miao: ID columns change between user schemas and requests may use a secondary ID not used in the table layout. - Owen: Why do this at the table level, if bloom filters are already supported in ORC and Parquet? Doesn’t that duplicate work? - Miao: We didn’t want to be tied to a specific format. - Owen: Are bloom filters the right solution? It’s easy to misconfigure them - Ryan: I agree they are easy to do incorrectly, but I think there is a good argument for this as a secondary index that is independent of the file format. Bloom filters are hard to get right, especially if you’re trying to do it while minimizing memory consumption for a columnar file format. Usually, parameters are chosen up front and might be wrong. At a table level, this could be moved offline, so indexes are maintained by a service that can do a better job choosing the right tuning parameters for each file bloom filter. - Ryan: I think we should think of this as a secondary index. We might have other techniques for secondary indexes, this one happens to use a bloom filter. We’ve had other people interested in secondary indexes, so maybe this is a good opportunity to add a way to track and maintain them. - Miao agreed to write up their use case and approach to start the discussion on secondary indexes. - Update on row-level deletes - Ryan: If you want to get involved, we’ve updated the Milestone with tasks. Tasks like writing a row filter using a set of equality values should be good candidates because they can be written independently and tested in isolation before the other work is done. - Ryan: Another area that could use help is updating tests for sequence numbers. We’re running all of the operations tests that extend TableTestBase on both v1 and v2 tables. I’ve added a way to make assertions for v1 and v2, using V1Assert.assertEquals or V2Assert.assertEquals so we can go back and add assertions to all of the existing tests that exercise lots of different cases. It would be great to have more help adding those sequence number assertions! - Ryan: In the last few weeks, we’ve added content types to metadata that tracks whether files contain deletes or data in the metadata tree. There’s currently an open PR, #1064 <https://github.com/apache/iceberg/pull/1064>, that adds DeleteFile and extends readers and writers to work with them. We decided last sync to store either delete files or data files in a manifest, but not both. Using separate interfaces enforces this in Java. I’m also working on a branch that separates the manifests in a snapshot into delete manifests and data manifests, which will help us identify everything that needs to be updated to support delete manifests. - Ryan: If you’d like to help review, please speak up and we’ll tag you on issues. (Gautham Kowshik and Ryan Murray volunteered.) -- Ryan Blue