Re: Read Hive ACID tables in Spark or Pig

Nicolas Paris Sat, 09 Mar 2019 16:45:14 -0800

Thanks Alan for the clarifications.

Hive has made such improvements it has lost its old friends in the
process. Hope one day all the friends speak together again: pig, spark,
presto read/write ACID together.


On Sat, Mar 09, 2019 at 02:23:48PM -0800, Alan Gates wrote:
> There's only been one significant change in ACID that requires different
> implementations.  In ACID v1 delta files contained inserts, updates, and
> deletes.  In ACID v2 delta files are split so that inserts are placed in one
> file, deletes in another, and updates are an insert plus a delete.  This 
> change
> was put into Hive 3, so you have to upgrade your ACID tables when upgrading
> from Hive 2 to 3.
> 
> You can see info on ACID v1 at 
> https://cwiki.apache.org/confluence/display/Hive
> /Hive+Transactions
> 
> You can get a start understanding ACID v2 with https://issues.apache.org/jira/
> browse/HIVE-14035  This has design documents.  I don't guarantee the
> implementation completely matches the design, but you can at least get an idea
> of the intent and follow the JIRA stream from there to see what was
> implemented.
> 
> Alan.
> 
> On Sat, Mar 9, 2019 at 3:25 AM Nicolas Paris <[email protected]> wrote:
> 
>     Hi,
> 
>     > The issue is that outside readers don't understand which records in
>     > the delta files are valid and which are not. Theoretically all this
>     > is possible, as outside clients could get the valid transaction list
>     > from the metastore and then read the files, but no one has done this
>     > work.
> 
>     I guess each hive version (1,2,3) differ in how they manage delta files
>     isn't ? This means pig or spark need to implement 3 different ways of
>     dealing with hive.
> 
>     Is there any documentation that would help a developper to implement
>     those specific connectors ?
> 
>     Thanks
> 
> 
>     On Wed, Mar 06, 2019 at 09:51:51AM -0800, Alan Gates wrote:
>     > Pig is in the same place as Spark, that the tables need to be compacted
>     first. 
>     > The issue is that outside readers don't understand which records in the
>     delta
>     > files are valid and which are not.
>     >
>     > Theoretically all this is possible, as outside clients could get the
>     valid
>     > transaction list from the metastore and then read the files, but no one
>     has
>     > done this work.
>     >
>     > Alan.
>     >
>     > On Wed, Mar 6, 2019 at 8:28 AM Abhishek Gupta <[email protected]>
>     wrote:
>     >
>     >     Hi,
>     >
>     >     Does Hive ACID tables for Hive version 1.2 posses the capability of
>     being
>     >     read into Apache Pig using HCatLoader or Spark using SQLContext.
>     >     For Spark, it seems it is only possible to read ACID tables if the
>     table is
>     >     fully compacted i.e no delta folders exist in any partition. Details
>     in the
>     >     following JIRA
>     >
>     >     https://issues.apache.org/jira/browse/SPARK-15348, https://
>     >     issues.apache.org/jira/browse/SPARK-15348
>     >
>     >     However I wanted to know if it is supported at all in Apache Pig to
>     read
>     >     ACID tables in Hive
>     >
> 
>     --
>     nicolas
> 

-- 
nicolas

Re: Read Hive ACID tables in Spark or Pig

Reply via email to