Re: A proposal for read-only external table for cloud-native Hive deployment

Thai Bui Tue, 06 Aug 2019 12:18:02 -0700

Hello all -- This thread is old but I just wanted to get an update with
newer information and not spam the dev forum with too much information.

To recap: My previous discussion was about proposing read-only transaction
support for Hive using external tables. This could be supported using
insert-only tables managed by Hive which requires a 1-time reingestion.
However, the recent development of the Delta Lake initiative
<https://delta.io/> from DataBricks is getting traction and it could work
as a neutral standard for big data tools to support ACID transactions
natively on the cloud for external tables (not only Hive). Have the
community considered supporting this option? And what would it take to have
Hive support this?

This has several advantages over the current approach:

1. ACID transactions support is now possible with external tables, across
different tools.
2. The metadata can be externalized and thus is more scalable that can
support millions of partitions (currently, my company existing metastores
struggle on a few hundred thousand to a million external partitions per
table backed by RDS).

It would be great if Hive can support this project for the aforementioned
reasons and of course, the support for ORC format would be really good as
well since Delta Lake only supports Parquet as of today.

Thai

On Fri, Apr 26, 2019 at 11:31 AM Thai Bui <blquyt...@gmail.com> wrote:

> My suggestion does require a change to your ETL process, but it doesn't
>> require you to copy the data into HDFS or to create storage clusters.
>> Hive
>> managed tables can reside in S3 with no problem.
>
>
> Thanks for pointing this out. I totally forget that managed tables could
> have a location externally specified. I think we can cope with this
> approach in the short-term but in the long-term, a more ETL-less approach
> is much more preferable with read-only transactional support for external
> tables. Mainly to avoid duplicate copies of data.
>
> This is actually a common ask when it comes to OnPrem -> Cloud REPL
>> streams, to avoid diverging.
>> The replicated data having its own updates is very problematic for CDC
>> style ACID replication into the cloud.
>
>
> It's a common problem when the pattern is replicating data everywhere and
> the users (such as analysts) don't know its full implications, which we are
> trying to avoid in the first place. But sometime, it's unavoidable if you
> are going on-prem -> cloud. With ACID support for read-only tables though,
> we'll give the users an option to "try it out" before fully commit to an
> ETL process to copy/optimize the data.
>
> On Thu, Apr 25, 2019 at 4:54 PM Gopal Vijayaraghavan <gop...@apache.org>
> wrote:
>
>> >    reuse the transactional_properties and add 'read_only' as a new
>> value. With
>> >    read-only tables, all INSERT, UPDATE, DELETE statements will fail at
>> Hive
>> >    front-end.
>>
>> This is actually a common ask when it comes to OnPrem -> Cloud REPL
>> streams, to avoid diverging.
>>
>> The replicated data having its own updates is very problematic for CDC
>> style ACID replication into the cloud.
>>
>> Ranger authorization works great for this, though it is all-or-nothing
>> right now.
>>
>> At some point in the future, I wish I could lock up specific fields from
>> being updated in ACID.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>
> --
> Thai
>

-- 
Thai

Re: A proposal for read-only external table for cloud-native Hive deployment

Reply via email to