Greetings! We're reading input files with newApiHadoopFile that is configured with multiline split. Everything's fine, besides https://issues.apache.org/jira/browse/MAPREDUCE-6549. It looks like the issue is fixed, but within hadoop 2.7.2. Which means we have to download spark without hadoop and provide custom version of it. Now we use spark-1.6.1.
It mostly fine, there is doc how to configure, spark started, but when I use it it gives me nasty exception about snappy cannot be initialized. I tried few things - update snappy version inside hadoop, package snappy into my own application jar, but it works only when I literally copy snappy-java.jar classes into spark-assembly-1.6.1-hadoop2.2.0.jar. It seems working for now, but I dislike this approach, because I simply cannot know what else won't work tomorrow. It looks like I can just turn off snappy, but I want it, I believe it makes sense to compress data shuffled and stored around. Could you suggest any way besides copying these classes inside assembled spark jar file? The snappy exception Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 69, icomputer.petersburg.epam.com): java.io.IOException: java.lang.reflect.InvocationTargetException at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedConstructorAccessor9.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:72) at org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:65) at org.apache.spark.broadcast.TorrentBroadcast.org $apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:167) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219) ... 11 more Caused by: java.lang.IllegalArgumentException: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy at org.apache.spark.io.SnappyCompressionCodec$.liftedTree1$1(CompressionCodec.scala:171) at org.apache.spark.io.SnappyCompressionCodec$.org$apache$spark$io$SnappyCompressionCodec$$version$lzycompute(CompressionCodec.scala:168) at org.apache.spark.io.SnappyCompressionCodec$.org$apache$spark$io$SnappyCompressionCodec$$version(CompressionCodec.scala:168) at org.apache.spark.io.SnappyCompressionCodec.<init>(CompressionCodec.scala:152) ... 19 more Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy at org.apache.spark.io.SnappyCompressionCodec$.liftedTree1$1(CompressionCodec.scala:169) ... 22 more -- Be well! Jean Morozov