In my answer I assumed you run your program with "pyspark" command (e.g.
"pyspark mymainscript.py", pyspark should be on your path). In this case
workflow is as follows:

1. You create SparkConf object that simply contains your app's options.
2. You create SparkContext, which initializes your application. At this
point application connects to master and asks for resources.
3. You modify SparkContext object to include everything you want to make
available for mappers on other hosts, e.g. other "*.py" files.
4. You create RDD (e.g. with "sc.textFile") and run actual commands (e.g.
"map", "filter", etc.). SparkContext knows about your additional files, so
these commands are aware of your library code.

So, yes, in these settings you need to create "sc" (SparkContext object)
beforehand and make "*.py" files available on application's host.

With pyspark shell you already do have "sc" object initialized for you (try
running "pyspark" and typing "sc" + Enter - shell will print spark context
details). You can also use spark-submit [1], which will initialize
SparkContext from command line options. But essentially idea is always the
same: there's driver application running on one host that creates
SparkContext, collects dependencies, controls program flow, etc., and there
are workers - applications on slave hosts, that use created SparkContext
and all serialized data to perform driver's commands. Driver should know
about everything and let workers know about what they need to know (e.g.
your library code).


[1]: http://spark.apache.org/docs/latest/submitting-applications.html





On Thu, Jun 5, 2014 at 8:10 PM, mrm <ma...@skimlinks.com> wrote:

> Hi Andrei,
>
> Thank you for your help! Just to make sure I understand, when I run this
> command sc.addPyFile("/path/to/yourmodule.py"), I need to be already logged
> into the master node and have my python files somewhere, is that correct?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Loading-Python-libraries-into-Spark-tp7059p7073.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to