Re: Memory & compute-intensive tasks

2014-08-04 Thread rpandya
This one turned out to be another problem with my app configuration, not with Spark. The compute task was dependent on the local filesystem, and config errors on 8 of 10 of the nodes made them fail early. The Spark wrapper was not checking the process exit value, so it appeared as if they were prod

Re: Memory & compute-intensive tasks

2014-07-29 Thread Matei Zaharia
Is data being cached? It might be that those two nodes started first and did the first pass of the data, so it's all on them. It's kind of ugly but you can add a Thread.sleep when your program starts to wait for nodes to come up. Also, have you checked the applicatio web UI at http://:4040 while

Re: Memory & compute-intensive tasks

2014-07-29 Thread rpandya
OK, I did figure this out. I was running the app (avocado) using spark-submit, when it was actually designed to take command line arguments to connect to a spark cluster. Since I didn't provide any such arguments, it started a nested local Spark cluster *inside* the YARN Spark executor and so of co

Re: Memory & compute-intensive tasks

2014-07-18 Thread rpandya
I also tried increasing --num-executors to numNodes * coresPerNode and using coalesce(numNodes*10,true), and it still ran all the tasks on one node. It seems like it is placing all the executors on one node (though not always the same node, which indicates it is aware of more than one!). I'm using

Re: Memory & compute-intensive tasks

2014-07-18 Thread rpandya
Hi Matei- Changing to coalesce(numNodes, true) still runs all partitions on a single node, which I verified by printing the hostname before I exec the external process. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-compute-intensive-tasks-tp9643p1

Re: Memory & compute-intensive tasks

2014-07-16 Thread Liquan Pei
Hi Ravi, I have seen a similar issue before. You can try to set fs.hdfs.impl.disable.cache to true in your hadoop configuration. For example, suppose your hadoop configuration file is hadoopConf, you can use hadoopConf.setBoolean("fs.hdfs.impl.disable.cache", true) Let me know if that helps. Bes

Re: Memory & compute-intensive tasks

2014-07-16 Thread rpandya
Matei - I tried using coalesce(numNodes, true), but it then seemed to run too few SNAP tasks - only 2 or 3 when I had specified 46. The job failed, perhaps for unrelated reasons, with some odd exceptions in the log (at the end of this message). But I really don't want to force data movement between

Re: Memory & compute-intensive tasks

2014-07-14 Thread Daniel Siegmann
Depending on how your C++ program is designed, maybe you can feed the data from multiple partitions into the same process? Getting the results back might be tricky. But that may be the only way to guarantee you're only using one invocation per node. On Mon, Jul 14, 2014 at 5:12 PM, Matei Zaharia

Re: Memory & compute-intensive tasks

2014-07-14 Thread Matei Zaharia
I think coalesce with shuffle=true will force it to have one task per node. Without that, it might be that due to data locality it decides to launch multiple ones on the same node even though the total # of tasks is equal to the # of nodes. If this is the *only* thing you run on the cluster, yo

Re: Memory & compute-intensive tasks

2014-07-14 Thread Daniel Siegmann
I don't have a solution for you (sorry), but do note that rdd.coalesce(numNodes) keeps data on the same nodes where it was. If you set shuffle=true then it should repartition and redistribute the data. But it uses the hash partitioner according to the ScalaDoc - I don't know of any way to supply a