I am currently using Spark 1.1.0 that has been compiled against Hadoop 2.3. Our
cluster is CDH5.1.2 which is runs Hive 0.12. I have two external Hive tables
that point to Parquet (compressed with Snappy), which were converted over from
Avro if that matters.
I am trying to perform a join with these two Hive tables, but am encountering
an exception. In a nutshell, I launch a spark shell, create my HiveContext
(pointing to the correct metastore on our cluster), and then proceed to do the
following:
scala> val hc = new HiveContext(sc)
scala> val txn = hc.sql(“select * from pqt_rdt_snappy where transdate >=
1325376000000 and translate <= 1340063999999”)
scala> val segcust = hc.sql(“select * from pqt_segcust_snappy where
coll_def_id=‘abcd’”)
scala> txn.registerAsTable(“segTxns”)
scala> segcust.registerAsTable(“segCusts”)
scala> val joined = hc.sql(“select t.transid, c.customer_id from segTxns t join
segCusts c on t.customerid=c.customer_id”)
Straight forward enough, but I get the following exception:
14/10/13 14:37:12 ERROR Executor: Exception in task 1.0 in stage 18.0 (TID 51)
java.lang.IndexOutOfBoundsException: Index: 21, Size: 21
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
at
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:94)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:206)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:81)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:67)
at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:51)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:197)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:188)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:97)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
The number of columns in my table, pqt_segcust_snappy, has 21 columns and two
partitions defined. Does this error look familiar to anyone? Could my usage of
SparkSQL with Hive be incorrect or is support with Hive/Parquet/partitioning
still buggy at this point in Spark 1.1.0?
Thanks,
-Terry