Ok, I get it. Now what can we do to improve the current situation, because
right now if I want to set-up a CI env for PySpark, I have to :
1- download a pre-built version of pyspark and unzip it somewhere on every
agent
2- define the SPARK_HOME env
3- symlink this distribution pyspark dir inside the python install dir
site-packages/ directory
and if I rely on additional packages (like databricks' Spark-CSV project),
I have to (except if I'm mistaken)
4- compile/assembly spark-csv, deploy the jar in a specific directory on
every agent
5- add this jar-filled directory to the Spark distribution's additional
classpath using the conf/spark-default file

Then finally we can launch our unit/integration-tests.
Some issues are related to spark-packages, some to the lack of python-based
dependency, and some to the way SparkContext are launched when using
pyspark.
I think step 1 and 2 are fair enough
4 and 5 may already have solutions, I didn't check and considering
spark-shell is downloading such dependencies automatically, I think if
nothing's done yet it will (I guess ?).

For step 3, maybe just adding a setup.py to the distribution would be
enough, I'm not exactly advocating to distribute a full 300Mb spark
distribution in PyPi, maybe there's a better compromise ?

Regards,

Olivier.

Le ven. 5 juin 2015 à 22:12, Jey Kottalam <j...@cs.berkeley.edu> a écrit :

> Couldn't we have a pip installable "pyspark" package that just serves as a
> shim to an existing Spark installation? Or it could even download the
> latest Spark binary if SPARK_HOME isn't set during installation. Right now,
> Spark doesn't play very well with the usual Python ecosystem. For example,
> why do I need to use a strange incantation when booting up IPython if I
> want to use PySpark in a notebook with MASTER="local[4]"? It would be much
> nicer to just type `from pyspark import SparkContext; sc =
> SparkContext("local[4]")` in my notebook.
>
> I did a test and it seems like PySpark's basic unit-tests do pass when
> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>
>
> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
> python $SPARK_HOME/python/pyspark/rdd.py
>
> -Jey
>
>
> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen <rosenvi...@gmail.com> wrote:
>
>> This has been proposed before:
>> https://issues.apache.org/jira/browse/SPARK-1267
>>
>> There's currently tighter coupling between the Python and Java halves of
>> PySpark than just requiring SPARK_HOME to be set; if we did this, I bet
>> we'd run into tons of issues when users try to run a newer version of the
>> Python half of PySpark against an older set of Java components or
>> vice-versa.
>>
>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>>> Hi everyone,
>>> Considering the python API as just a front needing the SPARK_HOME
>>> defined anyway, I think it would be interesting to deploy the Python part
>>> of Spark on PyPi in order to handle the dependencies in a Python project
>>> needing PySpark via pip.
>>>
>>> For now I just symlink the python/pyspark in my python install dir
>>> site-packages/ in order for PyCharm or other lint tools to work properly.
>>> I can do the setup.py work or anything.
>>>
>>> What do you think ?
>>>
>>> Regards,
>>>
>>> Olivier.
>>>
>>
>>
>

Reply via email to