I agree with everything Justin just said. An additional advantage of
publishing PySpark's Python code in a standards-compliant way is the fact
that we'll be able to declare transitive dependencies (Pandas, Py4J) in a
way that pip can use. Contrast this with the current situation, where
df.toPandas() exists in the Spark API but doesn't actually work until you
install Pandas.

Punya
On Wed, Jul 22, 2015 at 12:49 PM Justin Uang <justin.u...@gmail.com> wrote:

> // + *Davies* for his comments
> // + Punya for SA
>
> For development and CI, like Olivier mentioned, I think it would be hugely
> beneficial to publish pyspark (only code in the python/ dir) on PyPI. If
> anyone wants to develop against PySpark APIs, they need to download the
> distribution and do a lot of PYTHONPATH munging for all the tools (pylint,
> pytest, IDE code completion). Right now that involves adding python/ and
> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more
> dependencies, we would have to manually mirror all the PYTHONPATH munging
> in the ./pyspark script. With a proper pyspark setup.py which declares its
> dependencies, and a published distribution, depending on pyspark will just
> be adding pyspark to my setup.py dependencies.
>
> Of course, if we actually want to run parts of pyspark that is backed by
> Py4J calls, then we need the full spark distribution with either ./pyspark
> or ./spark-submit, but for things like linting and development, the
> PYTHONPATH munging is very annoying.
>
> I don't think the version-mismatch issues are a compelling reason to not
> go ahead with PyPI publishing. At runtime, we should definitely enforce
> that the version has to be exact, which means there is no backcompat
> nightmare as suggested by Davies in
> https://issues.apache.org/jira/browse/SPARK-1267. This would mean that
> even if the user got his pip installed pyspark to somehow get loaded before
> the spark distribution provided pyspark, then the user would be alerted
> immediately.
>
> *Davies*, if you buy this, should me or someone on my team pick up
> https://issues.apache.org/jira/browse/SPARK-1267 and
> https://github.com/apache/spark/pull/464?
>
> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot <
> o.girar...@lateral-thoughts.com> wrote:
>
>> Ok, I get it. Now what can we do to improve the current situation,
>> because right now if I want to set-up a CI env for PySpark, I have to :
>> 1- download a pre-built version of pyspark and unzip it somewhere on
>> every agent
>> 2- define the SPARK_HOME env
>> 3- symlink this distribution pyspark dir inside the python install dir
>> site-packages/ directory
>> and if I rely on additional packages (like databricks' Spark-CSV
>> project), I have to (except if I'm mistaken)
>> 4- compile/assembly spark-csv, deploy the jar in a specific directory on
>> every agent
>> 5- add this jar-filled directory to the Spark distribution's additional
>> classpath using the conf/spark-default file
>>
>> Then finally we can launch our unit/integration-tests.
>> Some issues are related to spark-packages, some to the lack of
>> python-based dependency, and some to the way SparkContext are launched when
>> using pyspark.
>> I think step 1 and 2 are fair enough
>> 4 and 5 may already have solutions, I didn't check and considering
>> spark-shell is downloading such dependencies automatically, I think if
>> nothing's done yet it will (I guess ?).
>>
>> For step 3, maybe just adding a setup.py to the distribution would be
>> enough, I'm not exactly advocating to distribute a full 300Mb spark
>> distribution in PyPi, maybe there's a better compromise ?
>>
>> Regards,
>>
>> Olivier.
>>
>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam <j...@cs.berkeley.edu> a écrit :
>>
>>> Couldn't we have a pip installable "pyspark" package that just serves as
>>> a shim to an existing Spark installation? Or it could even download the
>>> latest Spark binary if SPARK_HOME isn't set during installation. Right now,
>>> Spark doesn't play very well with the usual Python ecosystem. For example,
>>> why do I need to use a strange incantation when booting up IPython if I
>>> want to use PySpark in a notebook with MASTER="local[4]"? It would be much
>>> nicer to just type `from pyspark import SparkContext; sc =
>>> SparkContext("local[4]")` in my notebook.
>>>
>>> I did a test and it seems like PySpark's basic unit-tests do pass when
>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>>>
>>>
>>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>>> python $SPARK_HOME/python/pyspark/rdd.py
>>>
>>> -Jey
>>>
>>>
>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen <rosenvi...@gmail.com>
>>> wrote:
>>>
>>>> This has been proposed before:
>>>> https://issues.apache.org/jira/browse/SPARK-1267
>>>>
>>>> There's currently tighter coupling between the Python and Java halves
>>>> of PySpark than just requiring SPARK_HOME to be set; if we did this, I bet
>>>> we'd run into tons of issues when users try to run a newer version of the
>>>> Python half of PySpark against an older set of Java components or
>>>> vice-versa.
>>>>
>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot <
>>>> o.girar...@lateral-thoughts.com> wrote:
>>>>
>>>>> Hi everyone,
>>>>> Considering the python API as just a front needing the SPARK_HOME
>>>>> defined anyway, I think it would be interesting to deploy the Python part
>>>>> of Spark on PyPi in order to handle the dependencies in a Python project
>>>>> needing PySpark via pip.
>>>>>
>>>>> For now I just symlink the python/pyspark in my python install dir
>>>>> site-packages/ in order for PyCharm or other lint tools to work properly.
>>>>> I can do the setup.py work or anything.
>>>>>
>>>>> What do you think ?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Olivier.
>>>>>
>>>>
>>>>
>>>

Reply via email to