All - would appreciate some insight regarding how to set PYSPARK_PYTHON
correctly.
I have created a virtual environment in the same place for all 3 of my
cluster hosts, 2 of them running slaves and one running a master. I also
run an RPC server on the master host to allow users from the office (the
cluster is hosted elsewhere) to send the work.
For the master and slaves, I created $SPARK_HOME/conf/spark-env.sh and
set PYSPARK_PYTHON to the executable of my virtualenv. I made
spark-env.sh executable by all as suggested by the docs even though it
appears that it is sourced to be safe.
I then started the cluster using start-master.sh and start-slave.sh
accordingly and inspected the environment variables of each process
under /proc/pid to confirm PYSPARK_PYTHON was set correctly, which it
was. I then sent the first bunch of work only to get exceptions logged
in the driver program (the RPC server) from the slaves being unable to
import my modules upon unpickling data.
After several hours of reading docs and pulling out hair, I tried
setting PYSPARK_PYTHON into the environment in the code of the RPC
server / driver program as follows, based on this mailing list query:
https://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3CCAG-p0g2L=z9H1H4ZY1XdLOGnGyPEKqi8+=tpieqvdwtvwwa...@mail.gmail.com%3E
os.environ['PYSPARK_PYTHON'] ='/path/to/virtualenv/bin/python'
and to my surprise that worked. I don't understand why that makes sense
as I can't find any mention of the environment of the driver program
overriding the environment in the workers, also that environment
variable was previously completely unset in the driver program anyway.
Is there an explanation for this to help me understand how to do things
properly? We run Spark 1.6.0 on Ubuntu 14.04.
Thanks
Kostas