Hi, I am using pyspark and I'm trying to support both Spark 1.0.2 and 1.1.0 with my app, which will run in yarn-client mode. However, it appears when I use 'map' to run a python lambda function over an RDD, they appear to be run on different machines, and this is causing problems.
In both cases, I am using a Hadoop cluster that runs linux on all of its nodes. I am submitting my jobs with a machine running Mac OS X 10.9. As a reproducer, here is my script: import platform print sc.parallelize([1]).map(lambda x: platform.system()).take(1)[0] The answer in Spark 1.1.0: 'Linux' The answer in Spark 1.0.2: 'Darwin' In other experiments I changed the size of the list that gets parallelized, thinking maybe 1.0.2 just runs jobs on the driver node if they're small enough. I got the same answer (with only 1 million numbers). This is a troubling difference. I would expect all functions run on an RDD to be executed on my worker nodes in the Hadoop cluster, but this is clearly not the case for 1.0.2. Why does this difference exist? How can I accurately detect which jobs will run where? Thank you, Evan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/where-are-my-python-lambda-functions-run-in-yarn-client-mode-tp16059.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org