Here are my notes from yesterday’s sync. As usual, feel free to add to this if I missed something.
There were a couple of questions raised during the sync that we’d like to open up to anyone who wasn’t able to attend: - Should we wait for the parallel metadata rewrite action before cutting 0.8.0 candidates? - Should we wait for ORC metrics before cutting 0.8.0 candidates? In the sync, we thought that it would be good to wait and get these in. Please reply to this if you agree or disagree. Thanks! *Attendees*: - Ryan Blue - Dan Weeks - Anjali Norwood - Jun Ma - Ratandeep Ratti - Pavan - Christine Mathiesen - Gautam Kowshik - Mass Dosage - Filip - Ryan Murray *Topics*: - 0.8.0 release blockers: actions, ORC metrics - Row-level delete update - Parquet vectorized read update - InputFormats and Hive support - Netflix branch *Discussion*: - 0.8.0 release - Ryan: we planned to get a candidate out this week, but I think we may want to wait on 2 things that are about ready - First: Anton is contributing an action to rewrite manifests in parallel that is close. Anyone interested? (Gautam is interested) - Second: ORC is passing correctness tests, but doesn’t have column-level metrics. Should we wait for this? - Ratandeep: ORC also lacks predicate push-down support - Ryan: I think metrics are more important than PPD because PPD is task side and metrics help reduce the number of tasks. If we wait on one, I’d prefer to wait on metrics - Ratandeep will look into whether he or Shardul can work on this - General consensus was to wait for these features before getting a candidate out - Row-level deletes - Good progress in several PRs on adding the parallel v2 write path, as Owen suggested last sync - Junjie contributed an update to the spec for file/position delete files - Parquet vectorized read - Dan: flat schema reads are primarily waiting on reviews - Dan: is anyone interested in complex type support? - Gautam needs struct and map support. 0.14.0 doesn’t support maps - Ryan (Murray): 0.17.0 will have lists, structs, and maps, but not maps of structs - Ryan (Blue): Because we have a translation layer in Iceberg to pass off to Spark, we don’t actually need support in Arrow. We are currently stuck on 0.14.0 because of changes that prevent us from avoiding a null check (see this comment <https://github.com/apache/incubator-iceberg/pull/723/files#r367667500> ) - InputFormat and Hive support - Ratandeep: Generic (mapreduce) InputFormat is in with hooks for Pig and Hive; need to start working on the serde side and building a Hive StorageHandler, missing DDL support - Ryan: What DDL support? - Ratandeep: Statements like ADD PARTITION - Ryan: How would all of this work in Hive? It isn’t clear what components are needed right now: StorageHandler? RawStore? HiveMetaHook? - Ratandeep: Currently working on only the read path, which requires a StorageHandler. The write path would be more difficult. - Mass Dosage: Working on a (mapred) InputFormat for Hive in iceberg-mr, started working on a serde in iceberg-hive. Interested in writes, but not in the short or medium term - Mass Dosage: The main problem is dependency conflicts between Hive and Iceberg, mainly Guava - Ryan: Anyone know a good replacement for Guava collections? - Ryan: In Avro, we have a module that shades Guava <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/pom.xml> and has a class with references <https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/src/main/java/org/apache/avro/GuavaClasses.java>. Then shading can minimize the shaded classes. We could do that here. - Ryan: Is Jackson also a problem? - Mass Dosage: Yes, and calcite - Ryan: Calcite probably isn’t referenced directly so we can hopefully avoid the consistent versions problem by excluding it - Netflix branch of Iceberg (with non-Iceberg additions) - Ryan: We’ve published a copy of our current Iceberg 0.7.0-based branch <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4> for Spark 2.4 with DSv2 backported <https://github.com/Netflix/spark> - The purpose of this is to share non-Iceberg work that we use to compliment Iceberg, like views, catalogs, and DSv2 tables - Views are SQL views <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view> that are stored and versioned like Iceberg metadata. This is how we are tracking views for Presto and Spark (coral integration would be nice!). We are contributing the Spark DSv2 ViewCatalog to upstream Spark - Metacat is an open metastore project from Netflix. The metacat package contains our metastore integration <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/metacat> for it. - The batch package contains Spark and Hive table implementations for Spark’s DSv2 <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/batch>, which we use for multi-catalog support. - Gautam: how will migration to Iceberg’s v2 format work for those of us in production using v1? - Ryan: Tables are explicitly updated to v2 and both v1 and v2 will be supported in parallel. Using v1 until everything is updated with v2 support takes care of forward-compatibility issues. This can be done on a per-table basis - Gautam: Does migration require rewriting metadata? - Ryan: No, the format is backward compatible with v1, so the update is metadata-only until the writers start using new metadata that v1 would ignore (deletes) and would incorrectly modify if it were to write to v2. - Ryan: Also, Iceberg already has a forward-compatibility check that will prevent v1 readers from loading a v2 table. -- Ryan Blue Software Engineer Netflix