Ok, I get it. Now what can we do to improve the current situation, because right now if I want to set-up a CI env for PySpark, I have to : 1- download a pre-built version of pyspark and unzip it somewhere on every agent 2- define the SPARK_HOME env 3- symlink this distribution pyspark dir inside the python install dir site-packages/ directory and if I rely on additional packages (like databricks' Spark-CSV project), I have to (except if I'm mistaken) 4- compile/assembly spark-csv, deploy the jar in a specific directory on every agent 5- add this jar-filled directory to the Spark distribution's additional classpath using the conf/spark-default file
Then finally we can launch our unit/integration-tests. Some issues are related to spark-packages, some to the lack of python-based dependency, and some to the way SparkContext are launched when using pyspark. I think step 1 and 2 are fair enough 4 and 5 may already have solutions, I didn't check and considering spark-shell is downloading such dependencies automatically, I think if nothing's done yet it will (I guess ?). For step 3, maybe just adding a setup.py to the distribution would be enough, I'm not exactly advocating to distribute a full 300Mb spark distribution in PyPi, maybe there's a better compromise ? Regards, Olivier. Le ven. 5 juin 2015 à 22:12, Jey Kottalam <j...@cs.berkeley.edu> a écrit : > Couldn't we have a pip installable "pyspark" package that just serves as a > shim to an existing Spark installation? Or it could even download the > latest Spark binary if SPARK_HOME isn't set during installation. Right now, > Spark doesn't play very well with the usual Python ecosystem. For example, > why do I need to use a strange incantation when booting up IPython if I > want to use PySpark in a notebook with MASTER="local[4]"? It would be much > nicer to just type `from pyspark import SparkContext; sc = > SparkContext("local[4]")` in my notebook. > > I did a test and it seems like PySpark's basic unit-tests do pass when > SPARK_HOME is set and Py4J is on the PYTHONPATH: > > > PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH > python $SPARK_HOME/python/pyspark/rdd.py > > -Jey > > > On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen <rosenvi...@gmail.com> wrote: > >> This has been proposed before: >> https://issues.apache.org/jira/browse/SPARK-1267 >> >> There's currently tighter coupling between the Python and Java halves of >> PySpark than just requiring SPARK_HOME to be set; if we did this, I bet >> we'd run into tons of issues when users try to run a newer version of the >> Python half of PySpark against an older set of Java components or >> vice-versa. >> >> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot < >> o.girar...@lateral-thoughts.com> wrote: >> >>> Hi everyone, >>> Considering the python API as just a front needing the SPARK_HOME >>> defined anyway, I think it would be interesting to deploy the Python part >>> of Spark on PyPi in order to handle the dependencies in a Python project >>> needing PySpark via pip. >>> >>> For now I just symlink the python/pyspark in my python install dir >>> site-packages/ in order for PyCharm or other lint tools to work properly. >>> I can do the setup.py work or anything. >>> >>> What do you think ? >>> >>> Regards, >>> >>> Olivier. >>> >> >> >