[ 
https://issues.apache.org/jira/browse/HIVE-15189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15662845#comment-15662845
 ] 

Alan Gates commented on HIVE-15189:
-----------------------------------

This is by design.  Work needs to be done in Spark to make it so that it 
understands the delta files (including which ones to read).  Note also that 
it's possible with ACID tables to have multiple base files simultaneously 
(since it works by keeping multiple versions of the data and using the metadata 
to determine which version a given reader should see), and I assume this would 
mess up spark readers as well.

You can work around the situation by forcing a major compaction before you read 
the data by doing "alter table compact" (see 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionCompact
 ),

> No base file for ACID table
> ---------------------------
>
>                 Key: HIVE-15189
>                 URL: https://issues.apache.org/jira/browse/HIVE-15189
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>    Affects Versions: 1.2.1
>         Environment: HDP 2.4, HDP2.5
>            Reporter: Benjamin BONNET
>
> Hi,
> When one creates a new ACID table and inserts data into it using INSERT INTO, 
> Hive does not write a 'base' file : it only creates a delta. That may lead to 
> two issues (at least):
> # when you try to read it, you might get  a 'serious problem' like that :
> {noformat}
> java.lang.RuntimeException: serious problem
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>         at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
>         at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
>         at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>         at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>         at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>         at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
>         at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:166)
>         at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>         at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:635)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:64)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:311)
>         at 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
>         at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>         at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>         at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>         at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>         at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.util.concurrent.ExecutionException: 
> java.lang.IllegalArgumentException: delta_0000000_0000000 does not start with 
> base_
>         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>         at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998)
>         ... 44 more
> {noformat}
> I do not always get this error, but when I get it, I need to drop/recreate 
> table.
> # Spark-sql does not see data as long as there is no base file. So as long as 
> compaction has not occurred, no data can be read using Spar-SQL. See 
> https://issues.apache.org/jira/browse/SPARK-16996
> I know Hive always creates a base file when ou INSERT OVERWRITE, but you 
> cannot always use OVERWRITE instead of INTO : in my use case, I use a 
> statement that writes data into partitions that already exist and partitions 
> that do not exist yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to