I do understand that Snappy is not splittable as such, but ORCFile is. In ORC blocks are compressed with snappy so there should be no problem with it.
Anyway ZLIB(used both in ORC and Parquet by default) is also not splittable but it works perfectly fine. 2015-12-30 16:26 GMT+01:00 Chris Fregly <ch...@fregly.com>: > Reminder that Snappy is not a splittable format. > > I've had success with Hive + LZF (splittable) and bzip2 (also splittable). > > Gzip is also not splittable, so you won't be utilizing your cluster to > process this data in parallel as only 1 task can read and process > unsplittable data - versus many tasks spread across the cluster. > > On Wed, Dec 30, 2015 at 6:45 AM, Dawid Wysakowicz < > wysakowicz.da...@gmail.com> wrote: > >> Didn't anyone used spark with orc and snappy compression? >> >> 2015-12-29 18:25 GMT+01:00 Dawid Wysakowicz <wysakowicz.da...@gmail.com>: >> >>> Hi, >>> >>> I have a table in hive stored as orc with compression = snappy. I try to >>> execute a query on that table that fails (previously I run it on table in >>> orc-zlib format and parquet so it is not the matter of query). >>> >>> I managed to execute this query with hive on tez on that tables. >>> >>> The exception i get is as follows: >>> >>> 15/12/29 17:16:46 WARN scheduler.DAGScheduler: Creating new stage failed >>> due to exception - job: 3 >>> java.lang.RuntimeException: serious problem >>> at >>> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) >>> at >>> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) >>> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) >>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) >>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) >>> at scala.Option.getOrElse(Option.scala:120) >>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >>> at >>> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) >>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) >>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) >>> at scala.Option.getOrElse(Option.scala:120) >>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >>> at >>> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) >>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) >>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) >>> at scala.Option.getOrElse(Option.scala:120) >>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >>> at >>> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.getPartitions(MapPartitionsWithPreparationRDD.scala:40) >>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) >>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) >>> at scala.Option.getOrElse(Option.scala:120) >>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >>> at >>> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) >>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) >>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) >>> at scala.Option.getOrElse(Option.scala:120) >>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >>> at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:82) >>> at >>> org.apache.spark.sql.execution.ShuffledRowRDD.getDependencies(ShuffledRowRDD.scala:59) >>> at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:226) >>> at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:224) >>> at scala.Option.getOrElse(Option.scala:120) >>> at org.apache.spark.rdd.RDD.dependencies(RDD.scala:224) >>> at >>> org.apache.spark.scheduler.DAGScheduler.visit$2(DAGScheduler.scala:388) >>> at >>> org.apache.spark.scheduler.DAGScheduler.getAncestorShuffleDependencies(DAGScheduler.scala:405) >>> at >>> org.apache.spark.scheduler.DAGScheduler.registerShuffleDependencies(DAGScheduler.scala:370) >>> at org.apache.spark.scheduler.DAGScheduler.org >>> $apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:253) >>> at >>> org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1.apply(DAGScheduler.scala:354) >>> at >>> org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1.apply(DAGScheduler.scala:351) >>> at scala.collection.immutable.List.foreach(List.scala:318) >>> at >>> org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:351) >>> at >>> org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:363) >>> at >>> org.apache.spark.scheduler.DAGScheduler.getParentStagesAndId(DAGScheduler.scala:266) >>> at >>> org.apache.spark.scheduler.DAGScheduler.newResultStage(DAGScheduler.scala:300) >>> at >>> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:734) >>> at >>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1466) >>> at >>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458) >>> at >>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447) >>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) >>> Caused by: java.util.concurrent.ExecutionException: >>> java.lang.IndexOutOfBoundsException: Index: 0 >>> at java.util.concurrent.FutureTask.report(FutureTask.java:122) >>> at java.util.concurrent.FutureTask.get(FutureTask.java:192) >>> at >>> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1016) >>> ... 48 more >>> Caused by: java.lang.IndexOutOfBoundsException: Index: 0 >>> at java.util.Collections$EmptyList.get(Collections.java:4454) >>> at >>> org.apache.hadoop.hive.ql.io.orc.OrcProto$Type.getSubtypes(OrcProto.java:12240) >>> at >>> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.getColumnIndicesFromNames(ReaderImpl.java:651) >>> at >>> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.getRawDataSizeOfColumns(ReaderImpl.java:634) >>> at >>> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:927) >>> at >>> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:836) >>> at >>> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:702) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>> at java.lang.Thread.run(Thread.java:745) >>> >>> I will be glad for any help on that matter. >>> >>> Regards >>> Dawid Wysakowicz >>> >> >> > > > -- > > *Chris Fregly* > Principal Data Solutions Engineer > IBM Spark Technology Center, San Francisco, CA > http://spark.tc | http://advancedspark.com >