All - would appreciate some insight regarding how to set PYSPARK_PYTHON correctly.

I have created a virtual environment in the same place for all 3 of my cluster hosts, 2 of them running slaves and one running a master. I also run an RPC server on the master host to allow users from the office (the cluster is hosted elsewhere) to send the work.

For the master and slaves, I created $SPARK_HOME/conf/spark-env.sh and set PYSPARK_PYTHON to the executable of my virtualenv. I made spark-env.sh executable by all as suggested by the docs even though it appears that it is sourced to be safe.

I then started the cluster using start-master.sh and start-slave.sh accordingly and inspected the environment variables of each process under /proc/pid to confirm PYSPARK_PYTHON was set correctly, which it was. I then sent the first bunch of work only to get exceptions logged in the driver program (the RPC server) from the slaves being unable to import my modules upon unpickling data.

After several hours of reading docs and pulling out hair, I tried setting PYSPARK_PYTHON into the environment in the code of the RPC server / driver program as follows, based on this mailing list query:

https://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3CCAG-p0g2L=z9H1H4ZY1XdLOGnGyPEKqi8+=tpieqvdwtvwwa...@mail.gmail.com%3E

os.environ['PYSPARK_PYTHON'] ='/path/to/virtualenv/bin/python'

and to my surprise that worked. I don't understand why that makes sense as I can't find any mention of the environment of the driver program overriding the environment in the workers, also that environment variable was previously completely unset in the driver program anyway.

Is there an explanation for this to help me understand how to do things properly? We run Spark 1.6.0 on Ubuntu 14.04.

Thanks

Kostas

Reply via email to