Hi,

I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0 with
my app, which will run in yarn-client mode.  However, it appears when I use
'map' to run a python lambda function over an RDD, they appear to be run on
different machines, and this is causing problems.

In both cases, I am using a Hadoop cluster that runs linux on all of its
nodes.  I am submitting my jobs with a machine running Mac OS X 10.9.  As a
reproducer, here is my script:

import platform
print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0]

The answer in Spark 1.1.0:
'Linux'

The answer in Spark 1.0.2:
'Darwin'

In other experiments I changed the size of the list that gets parallelized,
thinking maybe 1.0.2 just runs jobs on the driver node if they're small
enough.  I got the same answer (with only 1 million numbers).

This is a troubling difference.  I would expect all functions run on an RDD
to be executed on my worker nodes in the Hadoop cluster, but this is clearly
not the case for 1.0.2.  Why does this difference exist?  How can I
accurately detect which jobs will run where?

Thank you,

Evan




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to