In my answer I assumed you run your program with "pyspark" command (e.g. "pyspark mymainscript.py", pyspark should be on your path). In this case workflow is as follows:
1. You create SparkConf object that simply contains your app's options. 2. You create SparkContext, which initializes your application. At this point application connects to master and asks for resources. 3. You modify SparkContext object to include everything you want to make available for mappers on other hosts, e.g. other "*.py" files. 4. You create RDD (e.g. with "sc.textFile") and run actual commands (e.g. "map", "filter", etc.). SparkContext knows about your additional files, so these commands are aware of your library code. So, yes, in these settings you need to create "sc" (SparkContext object) beforehand and make "*.py" files available on application's host. With pyspark shell you already do have "sc" object initialized for you (try running "pyspark" and typing "sc" + Enter - shell will print spark context details). You can also use spark-submit [1], which will initialize SparkContext from command line options. But essentially idea is always the same: there's driver application running on one host that creates SparkContext, collects dependencies, controls program flow, etc., and there are workers - applications on slave hosts, that use created SparkContext and all serialized data to perform driver's commands. Driver should know about everything and let workers know about what they need to know (e.g. your library code). [1]: http://spark.apache.org/docs/latest/submitting-applications.html On Thu, Jun 5, 2014 at 8:10 PM, mrm <ma...@skimlinks.com> wrote: > Hi Andrei, > > Thank you for your help! Just to make sure I understand, when I run this > command sc.addPyFile("/path/to/yourmodule.py"), I need to be already logged > into the master node and have my python files somewhere, is that correct? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Loading-Python-libraries-into-Spark-tp7059p7073.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >