This one turned out to be another problem with my app configuration, not with
Spark. The compute task was dependent on the local filesystem, and config
errors on 8 of 10 of the nodes made them fail early. The Spark wrapper was
not checking the process exit value, so it appeared as if they were
prod
Is data being cached? It might be that those two nodes started first and did
the first pass of the data, so it's all on them. It's kind of ugly but you can
add a Thread.sleep when your program starts to wait for nodes to come up.
Also, have you checked the applicatio web UI at http://:4040 while
OK, I did figure this out. I was running the app (avocado) using
spark-submit, when it was actually designed to take command line arguments
to connect to a spark cluster. Since I didn't provide any such arguments, it
started a nested local Spark cluster *inside* the YARN Spark executor and so
of co
I also tried increasing --num-executors to numNodes * coresPerNode and using
coalesce(numNodes*10,true), and it still ran all the tasks on one node. It
seems like it is placing all the executors on one node (though not always
the same node, which indicates it is aware of more than one!). I'm using
Hi Matei-
Changing to coalesce(numNodes, true) still runs all partitions on a single
node, which I verified by printing the hostname before I exec the external
process.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-compute-intensive-tasks-tp9643p1
Hi Ravi,
I have seen a similar issue before. You can try to set
fs.hdfs.impl.disable.cache to true in your hadoop configuration. For
example, suppose your hadoop configuration file is hadoopConf, you can use
hadoopConf.setBoolean("fs.hdfs.impl.disable.cache", true)
Let me know if that helps.
Bes
Matei - I tried using coalesce(numNodes, true), but it then seemed to run too
few SNAP tasks - only 2 or 3 when I had specified 46. The job failed,
perhaps for unrelated reasons, with some odd exceptions in the log (at the
end of this message). But I really don't want to force data movement between
Depending on how your C++ program is designed, maybe you can feed the data
from multiple partitions into the same process? Getting the results back
might be tricky. But that may be the only way to guarantee you're only
using one invocation per node.
On Mon, Jul 14, 2014 at 5:12 PM, Matei Zaharia
I think coalesce with shuffle=true will force it to have one task per node.
Without that, it might be that due to data locality it decides to launch
multiple ones on the same node even though the total # of tasks is equal to the
# of nodes.
If this is the *only* thing you run on the cluster, yo
I don't have a solution for you (sorry), but do note that
rdd.coalesce(numNodes) keeps data on the same nodes where it was. If you
set shuffle=true then it should repartition and redistribute the data. But
it uses the hash partitioner according to the ScalaDoc - I don't know of
any way to supply a
10 matches
Mail list logo