Matei - I tried using coalesce(numNodes, true), but it then seemed to run too few SNAP tasks - only 2 or 3 when I had specified 46. The job failed, perhaps for unrelated reasons, with some odd exceptions in the log (at the end of this message). But I really don't want to force data movement between nodes. The input data is in HDFS and should already be somewhat balanced among the nodes. We've run this scenario using the simple "hadoop jar" runner and a custom format jar to break the input into 8-line chunks (paired FASTQ). Ideally I'd like Spark to do the minimum data movement to balance the work, feeding each task mostly from data local to that node.
Daniel - that's a good thought, I could invoke a small stub for each task that talks to a single local demon process over a socket, and serializes all the tasks on a given machine. Thanks, Ravi P.S. Log exceptions: 14/07/15 17:02:00 WARN yarn.ApplicationMaster: Unable to retrieve SparkContext in spite of waiting for 100000, maxNumTries = 10 Exception in thread "main" java.lang.NullPointerException at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized(ApplicationMaster.scala:233) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:110) ...and later... 14/07/15 17:11:07 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM 14/07/15 17:11:07 INFO yarn.ApplicationMaster: AppMaster received a signal. 14/07/15 17:11:07 WARN rdd.NewHadoopRDD: Exception in RecordReader.close() java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-compute-intensive-tasks-tp9643p9991.html Sent from the Apache Spark User List mailing list archive at Nabble.com.