Re: A proposal for read-only external table for cloud-native Hive deployment

Alan Gates Wed, 24 Apr 2019 16:06:01 -0700

Would a workflow like the following work then:
1. Non-Hive tool produces data
2. Do a Hive load into a managed table.  This effectively takes a snapshot
of the data.
3. Now you still have the data for Non-Hive tools to operate on, and in
Hive you get all the Hive 3 goodness.


This would introduce an additional copy of the data.  It would be
interesting to look at adding a copy on write semantic to a partition to
avoid this copy, but you don't need that to get going.

I'm not opposed to what you're suggesting, I'm just wondering if there are
other ways that will save you work and that will keep Hive more simple.

Alan.

On Wed, Apr 24, 2019 at 2:07 PM Thai Bui <[email protected]> wrote:

> As I understand, read-only ACID tables only work if your table is a managed
> table (so you'll have to create your table with CREATE TABLE
> .. TBLPROPERTIES ('transactional_properties'='insert_only') ) and Hive will
> control the data layout.
>
> Unfortunately, in my case, I'm concerned with external tables where data is
> written by other tools such as Spark, PySpark, Sqoop or older Hive clusters
> and Hadoop-based systems to cloud storage such as S3. My wish is to have
> materialized views and query result caching work directly on those data if
> and only if the table is registered as an external, read-only table in Hive
> 3 via the same ACID mechanism.
>
> On Wed, Apr 24, 2019 at 3:35 PM Alan Gates <[email protected]> wrote:
>
> > Have you looked at the insert only ACID tables in Hive 3 (
> > https://issues.apache.org/jira/browse/HIVE-14535 )?  These were designed
> > specifically with the cloud in mind, since the way Hive traditionally
> adds
> > new data doesn't work well in the cloud.  And they do not require ORC,
> they
> > work with any file format.
> >
> > Alan.
> >
> > On Wed, Apr 24, 2019 at 12:04 PM Thai Bui <[email protected]> wrote:
> >
> > > Hello all,
> > >
> > > Hive 3 has brought significant changes to the community with the
> support
> > > for ACID tables as default managed tables. With ACID tables, we can use
> > > features such as materialized views, query result caching for BI tools
> > and
> > > more. But without ACID tables such as external tables, Hive doesn't
> > support
> > > any of these advanced features which makes a majority of cloud-native
> > users
> > > like me sad :(.
> > >
> > > I propose we should support a more limited version of read-only
> external
> > > tables such that materialized views and query result caching would
> work.
> > > For example:
> > >
> > > CREATE EXTERNAL TABLE table_name (..) STORED AS ORC
> > > LOCATION 's3://some-bucket/some-dir'
> > > TBLPROPERTIES ('read-only': "true");
> > >
> > > In such tables, any data modification operations such as INSERT and
> > UPDATE
> > > would fail and DDL operations that "add" or "remove" partitions to the
> > > table would succeed such as "ALTER TABLE ... ADD PARTITION". This would
> > > make it possible for Hive to invalidate the cache and materialized
> views
> > > even when the table is an external table.
> > >
> > > Let me know what do you guys think and maybe I can start writing a wiki
> > > document describing the approach in greater details.
> > >
> > > Thanks,
> > > Thai
> > >
> >
>
>
> --
> Thai
>

Re: A proposal for read-only external table for cloud-native Hive deployment

Reply via email to