Setting PYSPARK_PYTHON in spark-env.sh vs from driver program

Kostas Chalikias Mon, 07 Mar 2016 09:06:02 -0800

All - would appreciate some insight regarding how to set PYSPARK_PYTHONcorrectly.

I have created a virtual environment in the same place for all 3 of mycluster hosts, 2 of them running slaves and one running a master. I alsorun an RPC server on the master host to allow users from the office (thecluster is hosted elsewhere) to send the work.

For the master and slaves, I created $SPARK_HOME/conf/spark-env.sh andset PYSPARK_PYTHON to the executable of my virtualenv. I madespark-env.sh executable by all as suggested by the docs even though itappears that it is sourced to be safe.

I then started the cluster using start-master.sh and start-slave.shaccordingly and inspected the environment variables of each processunder /proc/pid to confirm PYSPARK_PYTHON was set correctly, which itwas. I then sent the first bunch of work only to get exceptions loggedin the driver program (the RPC server) from the slaves being unable toimport my modules upon unpickling data.

After several hours of reading docs and pulling out hair, I triedsetting PYSPARK_PYTHON into the environment in the code of the RPCserver / driver program as follows, based on this mailing list query:


https://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3CCAG-p0g2L=z9H1H4ZY1XdLOGnGyPEKqi8+=tpieqvdwtvwwa...@mail.gmail.com%3E

os.environ['PYSPARK_PYTHON'] ='/path/to/virtualenv/bin/python'

and to my surprise that worked. I don't understand why that makes senseas I can't find any mention of the environment of the driver programoverriding the environment in the workers, also that environmentvariable was previously completely unset in the driver program anyway.

Is there an explanation for this to help me understand how to do thingsproperly? We run Spark 1.6.0 on Ubuntu 14.04.


Thanks

Kostas

Setting PYSPARK_PYTHON in spark-env.sh vs from driver program

Reply via email to