Re: Memory & compute-intensive tasks

Matei Zaharia Tue, 29 Jul 2014 14:38:24 -0700

Is data being cached? It might be that those two nodes started first and did 
the first pass of the data, so it's all on them. It's kind of ugly but you can 
add a Thread.sleep when your program starts to wait for nodes to come up.

Also, have you checked the applicatio web UI at http://<driver node>:4040 while
the app is running? It shows details of where each task and where each
partition of data is, which might show that e.g. some tasks are much longer
than others due to data skew, or stuff like that.

Matei

On July 29, 2014 at 10:13:14 AM, rpandya (r...@iecommerce.com) wrote:

OK, I did figure this out. I was running the app (avocado) using
spark-submit, when it was actually designed to take command line arguments
to connect to a spark cluster. Since I didn't provide any such arguments, it
started a nested local Spark cluster *inside* the YARN Spark executor and so
of course everything ran on one node. If I spin up a Spark cluster manually
and provide the spark master URI to avocado, it works fine.

Now, I've tried running a reasonable-sized job through (400GB of data on 10
HDFS/Spark nodes), and the partitioning is strange. Eight nodes get almost
nothing, and the other two nodes each get half the work. This happens
whether I use coalesce with shuffle=true or false before the work stage.
(Though if I use shuffle=true, it creates 3000 tasks to do the shuffle, and
still ends up with this skewed distribution!) Any suggestions on how to
figure out what's going on?

Thanks,

Ravi

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-compute-intensive-tasks-tp9643p10868.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Memory & compute-intensive tasks

Reply via email to