Re: [DISCUSS] Implementation strategies for supporting Iceberg tables in Hive

Daniel Weeks Mon, 29 Jul 2019 10:10:15 -0700

Owen or Carl,

Do you have any thoughts on this approach?  We had previously discussed
this but now that we've looked into it more closely there are a few areas
that are unclear.


HiveMetaHook looks like a good entry point for DDL (though as Adrien
pointed out, it doesn't cover all operations).

However, I'm not clear on where to hook in for DML operations.  Is there a
similar hook for table commits to a table? Seems like hijacking the
MoveTask would be working around the commit problem.

Also, I'm not clear on whether this approach removes the intermediate
copies of task task output for task/job commit, which we probably want
remove.  I assume this could be done with the custom OutputFormat and a
custom OutputCommitter.

-Dan









On Thu, Jul 25, 2019 at 3:37 PM RD <rdsr...@gmail.com> wrote:

> Hi Adrien,
>    We at LinkedIn went through a similar thought process, but given our
> Hive deployment is not that large, we are in the process of considering
> deprecating Hive and asking our users to move to Spark [since Spark
> supports Hive ql].
>
> I'm guessing we'd have to invest in Spark's catalog AFAICT, but we haven't
> investigated this yet.
>
> -Best.
>
>
>
>
>
>
> On Wed, Jul 24, 2019 at 1:53 PM Adrien Guillo
> <adrien.gui...@airbnb.com.invalid> wrote:
>
>> Hi Iceberg folks,
>>
>> In the last few months, we (the data infrastructure team at Airbnb) have
>> been closely following the project. We are currently evaluating potential
>> strategies to migrate our data warehouse to Iceberg. However, we have a
>> very large Hive deployment, which means we can’t really do so without
>> support for Iceberg tables in Hive.
>>
>> We have been thinking about implementation strategies. Here are some
>> thoughts that we would like to share them with you:
>>
>> – Implementing a new `RawStore`
>>
>> This is something that has been mentioned several times on the mailing
>> list and seems to indicate that adding support for Iceberg tables in Hive
>> could be achieved without client-side modifications. Does that mean that
>> the Metastore is the only process manipulating Iceberg metadata (snapshots,
>> manifests)? Does that mean that for instance the `listPartition*` calls to
>> the Metastore return the DataFiles associated with each partition? Per our
>> understanding, it seems that supporting Iceberg tables in Hive with this
>> strategy will most likely require to update the RawStore interface AND will
>> require at least some client-side changes. In addition, with this strategy
>> the Metastore bears new responsibilities, which contradicts one of the
>> Iceberg design goals: offloading more work to jobs and removing the
>> metastore as a bottleneck. In the Iceberg world, not much is needed from
>> the Metastore: it just keeps track of the metadata location and provides a
>> mechanism for atomically updating this location (basically, what is done in
>> the `HiveTableOperations` class). We would like to design a solution that
>> relies  as little as possible on the Metastore so that in future we have
>> the option to replace our fleet of Metastores with a simpler system.
>>
>>
>> – Implementing a new `HiveStorageHandler`
>>
>> We are working on implementing custom `InputFormat` and `OutputFormat`
>> classes for Iceberg (more on that in the next paragraph) and they would fit
>> in nicely with the `HiveStorageHandler` and `HiveStoragePredicateHandler`
>> interfaces. However, the `HiveMetaHook` interface does not seem rich enough
>> to accommodate all the workflows, for instance no hooks run on `ALTER ...`
>>  or `INSERT...` commands.
>>
>>
>>
>> – Proof of concept
>>
>> We set out to write a proof of concept that would allow us to learn and
>> experiment. We based our work on the 2.3 branch. Here’s the state of the
>> project and the paths we explored:
>>
>> DDL commands
>> We support some commands such as `CREABLE TABLE ...`, `DESC ...`, `SHOW
>> PARTITIONS`. They are all implemented in the client and mostly rely on the
>> `HiveCatalog` class to do the work.
>>
>> Read path
>> We are in the process of implementing a custom `FileInputFormat` that
>> receives an Iceberg table identifier and a serialized expression
>> `ExprNodeDesc` as input. This is similar in a lot of ways to what you can
>> find in the `PigParquetReader` class in the `iceberg-pig` package or in
>> `HiveHBaseTableInputFormat` class in Hive.
>>
>>
>> Write path
>> We have made less progress in that front but we see a path forward by
>> implementing a custom `OutputFormat` that would keep track of the files
>> that are being written and gather statistics. Then, each task can dump this
>> information on HDFS. From there, the final Hive `MoveTask` can merge those
>> “pre-manifest” files to create a new snapshot and commit the new version of
>> a table.
>>
>>
>> We hope that our observations will start a healthy conversation about
>> supporting Iceberg tables in Hive :)
>>
>>
>> Cheers,
>> Adrien
>>
>

Re: [DISCUSS] Implementation strategies for supporting Iceberg tables in Hive

Reply via email to