Iceberg community sync notes - 15 April 2020

Ryan Blue Thu, 16 Apr 2020 14:31:14 -0700

Here are my notes from yesterday’s sync. As usual, feel free to add to this
if I missed something.


There were a couple of questions raised during the sync that we’d like to
open up to anyone who wasn’t able to attend:

   - Should we wait for the parallel metadata rewrite action before cutting
   0.8.0 candidates?
   - Should we wait for ORC metrics before cutting 0.8.0 candidates?

In the sync, we thought that it would be good to wait and get these in.
Please reply to this if you agree or disagree.

Thanks!

*Attendees*:

   - Ryan Blue
   - Dan Weeks
   - Anjali Norwood
   - Jun Ma
   - Ratandeep Ratti
   - Pavan
   - Christine Mathiesen
   - Gautam Kowshik
   - Mass Dosage
   - Filip
   - Ryan Murray

*Topics*:

   - 0.8.0 release blockers: actions, ORC metrics
   - Row-level delete update
   - Parquet vectorized read update
   - InputFormats and Hive support
   - Netflix branch

*Discussion*:

   - 0.8.0 release
      - Ryan: we planned to get a candidate out this week, but I think we
      may want to wait on 2 things that are about ready
      - First: Anton is contributing an action to rewrite manifests in
      parallel that is close. Anyone interested? (Gautam is interested)
      - Second: ORC is passing correctness tests, but doesn’t have
      column-level metrics. Should we wait for this?
      - Ratandeep: ORC also lacks predicate push-down support
      - Ryan: I think metrics are more important than PPD because PPD is
      task side and metrics help reduce the number of tasks. If we wait on one,
      I’d prefer to wait on metrics
      - Ratandeep will look into whether he or Shardul can work on this
      - General consensus was to wait for these features before getting a
      candidate out
   - Row-level deletes
      - Good progress in several PRs on adding the parallel v2 write path,
      as Owen suggested last sync
      - Junjie contributed an update to the spec for file/position delete
      files
   - Parquet vectorized read
      - Dan: flat schema reads are primarily waiting on reviews
      - Dan: is anyone interested in complex type support?
      - Gautam needs struct and map support. 0.14.0 doesn’t support maps
      - Ryan (Murray): 0.17.0 will have lists, structs, and maps, but not
      maps of structs
      - Ryan (Blue): Because we have a translation layer in Iceberg to pass
      off to Spark, we don’t actually need support in Arrow. We are currently
      stuck on 0.14.0 because of changes that prevent us from avoiding a null
      check (see this comment
      <https://github.com/apache/incubator-iceberg/pull/723/files#r367667500>
      )
   -

   InputFormat and Hive support
   - Ratandeep: Generic (mapreduce) InputFormat is in with hooks for Pig
      and Hive; need to start working on the serde side and building a Hive
      StorageHandler, missing DDL support
      - Ryan: What DDL support?
      -

      Ratandeep: Statements like ADD PARTITION
      -

      Ryan: How would all of this work in Hive? It isn’t clear what
      components are needed right now: StorageHandler? RawStore? HiveMetaHook?
      - Ratandeep: Currently working on only the read path, which requires
      a StorageHandler. The write path would be more difficult.
      - Mass Dosage: Working on a (mapred) InputFormat for Hive in
      iceberg-mr, started working on a serde in iceberg-hive. Interested in
      writes, but not in the short or medium term
      - Mass Dosage: The main problem is dependency conflicts between Hive
      and Iceberg, mainly Guava
      - Ryan: Anyone know a good replacement for Guava collections?
      - Ryan: In Avro, we have a module that shades Guava
      
<https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/pom.xml>
      and has a class with references
      
<https://github.com/apache/avro/blob/release-1.8.2/lang/java/guava/src/main/java/org/apache/avro/GuavaClasses.java>.
      Then shading can minimize the shaded classes. We could do that here.
      - Ryan: Is Jackson also a problem?
      - Mass Dosage: Yes, and calcite
      - Ryan: Calcite probably isn’t referenced directly so we can
      hopefully avoid the consistent versions problem by excluding it
   - Netflix branch of Iceberg (with non-Iceberg additions)
      - Ryan: We’ve published a copy of our current Iceberg 0.7.0-based
      branch <https://github.com/Netflix/iceberg/tree/netflix-spark-2.4>
      for Spark 2.4 with DSv2 backported <https://github.com/Netflix/spark>
      - The purpose of this is to share non-Iceberg work that we use to
      compliment Iceberg, like views, catalogs, and DSv2 tables
      - Views are SQL views
      
<https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/view/src/main/java/com/netflix/bdp/view>
      that are stored and versioned like Iceberg metadata. This is how we are
      tracking views for Presto and Spark (coral integration would be
nice!). We
      are contributing the Spark DSv2 ViewCatalog to upstream Spark
      - Metacat is an open metastore project from Netflix. The metacat
      package contains our metastore integration
      
<https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/metacat>
      for it.
      - The batch package contains Spark and Hive table implementations for
      Spark’s DSv2
      
<https://github.com/Netflix/iceberg/tree/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/batch>,
      which we use for multi-catalog support.
   - Gautam: how will migration to Iceberg’s v2 format work for those of us
   in production using v1?
      - Ryan: Tables are explicitly updated to v2 and both v1 and v2 will
      be supported in parallel. Using v1 until everything is updated with v2
      support takes care of forward-compatibility issues. This can be done on a
      per-table basis
      - Gautam: Does migration require rewriting metadata?
      - Ryan: No, the format is backward compatible with v1, so the update
      is metadata-only until the writers start using new metadata that v1 would
      ignore (deletes) and would incorrectly modify if it were to write to v2.
      - Ryan: Also, Iceberg already has a forward-compatibility check that
      will prevent v1 readers from loading a v2 table.

-- 
Ryan Blue
Software Engineer
Netflix

Iceberg community sync notes - 15 April 2020

Reply via email to