Couldn't we have a pip installable "pyspark" package that just serves as a shim to an existing Spark installation? Or it could even download the latest Spark binary if SPARK_HOME isn't set during installation. Right now, Spark doesn't play very well with the usual Python ecosystem. For example, why do I need to use a strange incantation when booting up IPython if I want to use PySpark in a notebook with MASTER="local[4]"? It would be much nicer to just type `from pyspark import SparkContext; sc = SparkContext("local[4]")` in my notebook.
I did a test and it seems like PySpark's basic unit-tests do pass when SPARK_HOME is set and Py4J is on the PYTHONPATH: PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH python $SPARK_HOME/python/pyspark/rdd.py -Jey On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen <rosenvi...@gmail.com> wrote: > This has been proposed before: > https://issues.apache.org/jira/browse/SPARK-1267 > > There's currently tighter coupling between the Python and Java halves of > PySpark than just requiring SPARK_HOME to be set; if we did this, I bet > we'd run into tons of issues when users try to run a newer version of the > Python half of PySpark against an older set of Java components or > vice-versa. > > On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote: > >> Hi everyone, >> Considering the python API as just a front needing the SPARK_HOME defined >> anyway, I think it would be interesting to deploy the Python part of Spark >> on PyPi in order to handle the dependencies in a Python project needing >> PySpark via pip. >> >> For now I just symlink the python/pyspark in my python install dir >> site-packages/ in order for PyCharm or other lint tools to work properly. >> I can do the setup.py work or anything. >> >> What do you think ? >> >> Regards, >> >> Olivier. >> > >