Spark shuffle: FileNotFound exception

Swapnil Shinde Sat, 03 Dec 2016 22:03:49 -0800

Hello All
    I am facing FileNotFoundException for shuffle index file when running
job with large data. Same job runs fine with smaller datasets. These our my
cluster specifications -


No of nodes - 19
Total cores - 380
Memory per executor - 32G
Spark 1.6 mapr version
spark.shuffle.service.enabled - false

         I am running job with 28G memory, 50 executors and 1 core per
executor. Job is failing at stage having dataframe explode where each row
gets multiplied to 6 rows. Here are exception details-

Caused by: java.lang.RuntimeException: java.io.FileNotFoundException:
/tmp/hadoop-mapr/nm-local-dir/usercache/sshinde/appcache/application_1480622725467_0071/blockmgr-3b2051f5-81c8-40a5-a332-9d32b4586a5d/38/shuffle_14_229_0.index
(No such file or directory)
        at java.io.FileInputStream.open0(Native Method)
        at java.io.FileInputStream.open(FileInputStream.java:195)
        at java.io.FileInputStream.<init>(FileInputStream.java:138)
        at
org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:191)
        at
org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:291)
        at
org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:58)
        at
org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:58)
        at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

I tried with below configurations but nothing worked out-
    conf.set("spark.io.compression.codec", "lz4")
    conf.set("spark.network.timeout", "1000s")
    conf.set("spark.sql.shuffle.partitions", "2500")
    spark.yarn.executor.memoryOverhead should be high due to 32g of
executor memory. (10% of 32g)
  Increased number of partitions till 15000
  I checked yarn logs briefly and nothing stand out apart from above
exception.


Please let me if there is something I am missing or alternatives to make
large data jobs run.  Thank you..

Thanks
Swapnil

Spark shuffle: FileNotFound exception

Reply via email to