Hi I'm running Spark 0.9.1 on hadoop cluster - cdh4.2.1, with YARN.
I have a job, that performs a few transformations on a given file and joins that file with some other. The job itself finishes with success, however some tasks are failed and then after rerun succeeds. During the development process I've been experimenting with different settings and have those now in the code: - additional hadoop config: "fs.hdfs.impl.disable.cache", "true" - spark config set on SparkContext: .set("spark.test.disableBlockManagerHeartBeat", "true") .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.default.parallelism", "1000") .set("spark.shuffle.netty.connect.timeout", "300000") .set("spark.storage.blockManagerSlaveTimeoutMs", "300000") When I look into the logs I see lots of error messages: - This looks like some problems with HA - but I've checked namenodes during the job was running, and there was no switch between master and slave namenode. 14/05/14 15:25:44 ERROR security.UserGroupInformation: PriviledgedActionException as:hc_client_reco_dev (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby 14/05/14 15:25:44 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby 14/05/14 15:25:44 ERROR security.UserGroupInformation: PriviledgedActionException as:hc_client_reco_dev (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby - there are multiple exceptions logged as INFO, don't know if this is serious: 14/05/14 15:30:06 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found 14/05/14 15:30:06 INFO network.ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@2c34bc84 java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:341) at org.apache.spark.network.ConnectionManager$$anon$3.run(ConnectionManager.scala:98) - I also see a few of those - which seems strange 14/05/14 15:26:45 ERROR executor.Executor: Exception in task ID 2081 java.io.FileNotFoundException: /data/storage/1/yarn/local/usercache/hc_client_reco_dev/appcache/application_1398268932983_1221792/spark-local-20140514152006-9c62/38/shuffle_5_121_395 (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:87) at org.apache.spark.storage.DiskStore.getValues(DiskStore.scala:105) at org.apache.spark.storage.BlockManager.getLocalFromDisk(BlockManager.scala:265) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:205) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator$$anonfun$getLocalBlocks$1.apply(BlockFetcherIterator.scala:204) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator.getLocalBlocks(BlockFetcherIterator.scala:204) at org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator.initialize(BlockFetcherIterator.scala:235) at org.apache.spark.storage.BlockManager.getMultiple(BlockManager.scala:452) at org.apache.spark.BlockStoreShuffleFetcher.fetch(BlockStoreShuffleFetcher.scala:77) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:125) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:115) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) Could someone suggest any solutions to that? Regards Marcin