Hi, I've just implemented a pipeline to synchronize data between MySQL and Hive (transactional + bucketized) onto HDP cluster. I've used Orc files but without ACID properties. Then, we've created external tables on these hdfs directories that contain these delta Orc files. Then, MERGE INTO queries are executed periodically to merge data into the Hive target table. It works pretty well but we want to avoid the use of these Merge queries. It's not really clear at the moment. But thanks for your links. I'm going to delve into that point. To resume, if i want to avoid these queries, I have to get the valid transaction for each table from Hive Metastore and, then, read all related files. Is it correct ?
Thanks, David Le dim. 10 mars 2019 à 01:45, Nicolas Paris <nicolas.pa...@riseup.net> a écrit : > Thanks Alan for the clarifications. > > Hive has made such improvements it has lost its old friends in the > process. Hope one day all the friends speak together again: pig, spark, > presto read/write ACID together. > > On Sat, Mar 09, 2019 at 02:23:48PM -0800, Alan Gates wrote: > > There's only been one significant change in ACID that requires different > > implementations. In ACID v1 delta files contained inserts, updates, and > > deletes. In ACID v2 delta files are split so that inserts are placed in > one > > file, deletes in another, and updates are an insert plus a delete. This > change > > was put into Hive 3, so you have to upgrade your ACID tables when > upgrading > > from Hive 2 to 3. > > > > You can see info on ACID v1 at > https://cwiki.apache.org/confluence/display/Hive > > /Hive+Transactions > > > > You can get a start understanding ACID v2 with > https://issues.apache.org/jira/ > > browse/HIVE-14035 This has design documents. I don't guarantee the > > implementation completely matches the design, but you can at least get > an idea > > of the intent and follow the JIRA stream from there to see what was > > implemented. > > > > Alan. > > > > On Sat, Mar 9, 2019 at 3:25 AM Nicolas Paris <nicolas.pa...@riseup.net> > wrote: > > > > Hi, > > > > > The issue is that outside readers don't understand which records in > > > the delta files are valid and which are not. Theoretically all this > > > is possible, as outside clients could get the valid transaction > list > > > from the metastore and then read the files, but no one has done > this > > > work. > > > > I guess each hive version (1,2,3) differ in how they manage delta > files > > isn't ? This means pig or spark need to implement 3 different ways of > > dealing with hive. > > > > Is there any documentation that would help a developper to implement > > those specific connectors ? > > > > Thanks > > > > > > On Wed, Mar 06, 2019 at 09:51:51AM -0800, Alan Gates wrote: > > > Pig is in the same place as Spark, that the tables need to be > compacted > > first. > > > The issue is that outside readers don't understand which records > in the > > delta > > > files are valid and which are not. > > > > > > Theoretically all this is possible, as outside clients could get > the > > valid > > > transaction list from the metastore and then read the files, but > no one > > has > > > done this work. > > > > > > Alan. > > > > > > On Wed, Mar 6, 2019 at 8:28 AM Abhishek Gupta <abhila...@gmail.com > > > > wrote: > > > > > > Hi, > > > > > > Does Hive ACID tables for Hive version 1.2 posses the > capability of > > being > > > read into Apache Pig using HCatLoader or Spark using > SQLContext. > > > For Spark, it seems it is only possible to read ACID tables if > the > > table is > > > fully compacted i.e no delta folders exist in any partition. > Details > > in the > > > following JIRA > > > > > > https://issues.apache.org/jira/browse/SPARK-15348, https:// > > > issues.apache.org/jira/browse/SPARK-15348 > > > > > > However I wanted to know if it is supported at all in Apache > Pig to > > read > > > ACID tables in Hive > > > > > > > -- > > nicolas > > > > -- > nicolas >