Re: PySpark on PyPi

Brian Granger Thu, 20 Aug 2015 10:04:17 -0700

Auberon, can you also post this to the Jupyter Google Group?

On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez <auberon.lo...@gmail.com> wrote:
> Hi all,
>
> I've created an updated PR for this based off of the previous work of
> @prabinb:
> https://github.com/apache/spark/pull/8318
>
> I am not very familiar with python packaging; feedback is appreciated.
>
> -Auberon
>
> On Mon, Aug 10, 2015 at 12:45 PM, MinRK <benjami...@gmail.com> wrote:
>>
>>
>> On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman <meawo...@gmail.com> wrote:
>>>
>>> I would tentatively suggest also conda packaging.
>>
>>
>> A conda package has the advantage that it can be set up without
>> 'installing' the pyspark files, while the PyPI packaging is still being
>> worked out. It can just add a pyspark.pth file pointing to pyspark, py4j
>> locations. But I think it's a really good idea to package with conda.
>>
>> -MinRK
>>
>>>
>>>
>>> http://conda.pydata.org/docs/
>>>
>>> --Matthew Goodman
>>>
>>> =====================
>>> Check Out My Website: http://craneium.net
>>> Find me on LinkedIn: http://tinyurl.com/d6wlch
>>>
>>> On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu <dav...@databricks.com>
>>> wrote:
>>>>
>>>> I think so, any contributions on this are welcome.
>>>>
>>>> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger <elliso...@gmail.com>
>>>> wrote:
>>>> > Sorry, trying to follow the context here. Does it look like there is
>>>> > support for the idea of creating a setup.py file and pypi package for
>>>> > pyspark?
>>>> >
>>>> > Cheers,
>>>> >
>>>> > Brian
>>>> >
>>>> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu <dav...@databricks.com>
>>>> > wrote:
>>>> >> We could do that after 1.5 released, it will have same release cycle
>>>> >> as Spark in the future.
>>>> >>
>>>> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
>>>> >> <o.girar...@lateral-thoughts.com> wrote:
>>>> >>> +1 (once again :) )
>>>> >>>
>>>> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang <justin.u...@gmail.com>:
>>>> >>>>
>>>> >>>> // ping
>>>> >>>>
>>>> >>>> do we have any signoff from the pyspark devs to submit a PR to
>>>> >>>> publish to
>>>> >>>> PyPI?
>>>> >>>>
>>>> >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman
>>>> >>>> <freeman.jer...@gmail.com>
>>>> >>>> wrote:
>>>> >>>>>
>>>> >>>>> Hey all, great discussion, just wanted to +1 that I see a lot of
>>>> >>>>> value in
>>>> >>>>> steps that make it easier to use PySpark as an ordinary python
>>>> >>>>> library.
>>>> >>>>>
>>>> >>>>> You might want to check out this
>>>> >>>>> (https://github.com/minrk/findspark),
>>>> >>>>> started by Jupyter project devs, that offers one way to facilitate
>>>> >>>>> this
>>>> >>>>> stuff. I’ve also cced them here to join the conversation.
>>>> >>>>>
>>>> >>>>> Also, @Jey, I can also confirm that at least in some scenarios
>>>> >>>>> (I’ve done
>>>> >>>>> it in an EC2 cluster in standalone mode) it’s possible to run
>>>> >>>>> PySpark jobs
>>>> >>>>> just using `from pyspark import SparkContext; sc =
>>>> >>>>> SparkContext(master=“X”)`
>>>> >>>>> so long as the environmental variables (PYTHONPATH and
>>>> >>>>> PYSPARK_PYTHON) are
>>>> >>>>> set correctly on *both* workers and driver. That said, there’s
>>>> >>>>> definitely
>>>> >>>>> additional configuration / functionality that would require going
>>>> >>>>> through
>>>> >>>>> the proper submit scripts.
>>>> >>>>>
>>>> >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal
>>>> >>>>> <punya.bis...@gmail.com>
>>>> >>>>> wrote:
>>>> >>>>>
>>>> >>>>> I agree with everything Justin just said. An additional advantage
>>>> >>>>> of
>>>> >>>>> publishing PySpark's Python code in a standards-compliant way is
>>>> >>>>> the fact
>>>> >>>>> that we'll be able to declare transitive dependencies (Pandas,
>>>> >>>>> Py4J) in a
>>>> >>>>> way that pip can use. Contrast this with the current situation,
>>>> >>>>> where
>>>> >>>>> df.toPandas() exists in the Spark API but doesn't actually work
>>>> >>>>> until you
>>>> >>>>> install Pandas.
>>>> >>>>>
>>>> >>>>> Punya
>>>> >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang
>>>> >>>>> <justin.u...@gmail.com>
>>>> >>>>> wrote:
>>>> >>>>>>
>>>> >>>>>> // + Davies for his comments
>>>> >>>>>> // + Punya for SA
>>>> >>>>>>
>>>> >>>>>> For development and CI, like Olivier mentioned, I think it would
>>>> >>>>>> be
>>>> >>>>>> hugely beneficial to publish pyspark (only code in the python/
>>>> >>>>>> dir) on PyPI.
>>>> >>>>>> If anyone wants to develop against PySpark APIs, they need to
>>>> >>>>>> download the
>>>> >>>>>> distribution and do a lot of PYTHONPATH munging for all the tools
>>>> >>>>>> (pylint,
>>>> >>>>>> pytest, IDE code completion). Right now that involves adding
>>>> >>>>>> python/ and
>>>> >>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to
>>>> >>>>>> add more
>>>> >>>>>> dependencies, we would have to manually mirror all the PYTHONPATH
>>>> >>>>>> munging in
>>>> >>>>>> the ./pyspark script. With a proper pyspark setup.py which
>>>> >>>>>> declares its
>>>> >>>>>> dependencies, and a published distribution, depending on pyspark
>>>> >>>>>> will just
>>>> >>>>>> be adding pyspark to my setup.py dependencies.
>>>> >>>>>>
>>>> >>>>>> Of course, if we actually want to run parts of pyspark that is
>>>> >>>>>> backed by
>>>> >>>>>> Py4J calls, then we need the full spark distribution with either
>>>> >>>>>> ./pyspark
>>>> >>>>>> or ./spark-submit, but for things like linting and development,
>>>> >>>>>> the
>>>> >>>>>> PYTHONPATH munging is very annoying.
>>>> >>>>>>
>>>> >>>>>> I don't think the version-mismatch issues are a compelling reason
>>>> >>>>>> to not
>>>> >>>>>> go ahead with PyPI publishing. At runtime, we should definitely
>>>> >>>>>> enforce that
>>>> >>>>>> the version has to be exact, which means there is no backcompat
>>>> >>>>>> nightmare as
>>>> >>>>>> suggested by Davies in
>>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267.
>>>> >>>>>> This would mean that even if the user got his pip installed
>>>> >>>>>> pyspark to
>>>> >>>>>> somehow get loaded before the spark distribution provided
>>>> >>>>>> pyspark, then the
>>>> >>>>>> user would be alerted immediately.
>>>> >>>>>>
>>>> >>>>>> Davies, if you buy this, should me or someone on my team pick up
>>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and
>>>> >>>>>> https://github.com/apache/spark/pull/464?
>>>> >>>>>>
>>>> >>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
>>>> >>>>>> <o.girar...@lateral-thoughts.com> wrote:
>>>> >>>>>>>
>>>> >>>>>>> Ok, I get it. Now what can we do to improve the current
>>>> >>>>>>> situation,
>>>> >>>>>>> because right now if I want to set-up a CI env for PySpark, I
>>>> >>>>>>> have to :
>>>> >>>>>>> 1- download a pre-built version of pyspark and unzip it
>>>> >>>>>>> somewhere on
>>>> >>>>>>> every agent
>>>> >>>>>>> 2- define the SPARK_HOME env
>>>> >>>>>>> 3- symlink this distribution pyspark dir inside the python
>>>> >>>>>>> install dir
>>>> >>>>>>> site-packages/ directory
>>>> >>>>>>> and if I rely on additional packages (like databricks' Spark-CSV
>>>> >>>>>>> project), I have to (except if I'm mistaken)
>>>> >>>>>>> 4- compile/assembly spark-csv, deploy the jar in a specific
>>>> >>>>>>> directory
>>>> >>>>>>> on every agent
>>>> >>>>>>> 5- add this jar-filled directory to the Spark distribution's
>>>> >>>>>>> additional
>>>> >>>>>>> classpath using the conf/spark-default file
>>>> >>>>>>>
>>>> >>>>>>> Then finally we can launch our unit/integration-tests.
>>>> >>>>>>> Some issues are related to spark-packages, some to the lack of
>>>> >>>>>>> python-based dependency, and some to the way SparkContext are
>>>> >>>>>>> launched when
>>>> >>>>>>> using pyspark.
>>>> >>>>>>> I think step 1 and 2 are fair enough
>>>> >>>>>>> 4 and 5 may already have solutions, I didn't check and
>>>> >>>>>>> considering
>>>> >>>>>>> spark-shell is downloading such dependencies automatically, I
>>>> >>>>>>> think if
>>>> >>>>>>> nothing's done yet it will (I guess ?).
>>>> >>>>>>>
>>>> >>>>>>> For step 3, maybe just adding a setup.py to the distribution
>>>> >>>>>>> would be
>>>> >>>>>>> enough, I'm not exactly advocating to distribute a full 300Mb
>>>> >>>>>>> spark
>>>> >>>>>>> distribution in PyPi, maybe there's a better compromise ?
>>>> >>>>>>>
>>>> >>>>>>> Regards,
>>>> >>>>>>>
>>>> >>>>>>> Olivier.
>>>> >>>>>>>
>>>> >>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam <j...@cs.berkeley.edu>
>>>> >>>>>>> a écrit
>>>> >>>>>>> :
>>>> >>>>>>>>
>>>> >>>>>>>> Couldn't we have a pip installable "pyspark" package that just
>>>> >>>>>>>> serves
>>>> >>>>>>>> as a shim to an existing Spark installation? Or it could even
>>>> >>>>>>>> download the
>>>> >>>>>>>> latest Spark binary if SPARK_HOME isn't set during
>>>> >>>>>>>> installation. Right now,
>>>> >>>>>>>> Spark doesn't play very well with the usual Python ecosystem.
>>>> >>>>>>>> For example,
>>>> >>>>>>>> why do I need to use a strange incantation when booting up
>>>> >>>>>>>> IPython if I want
>>>> >>>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It would
>>>> >>>>>>>> be much nicer
>>>> >>>>>>>> to just type `from pyspark import SparkContext; sc =
>>>> >>>>>>>> SparkContext("local[4]")` in my notebook.
>>>> >>>>>>>>
>>>> >>>>>>>> I did a test and it seems like PySpark's basic unit-tests do
>>>> >>>>>>>> pass when
>>>> >>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>>>> >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py
>>>> >>>>>>>>
>>>> >>>>>>>> -Jey
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen
>>>> >>>>>>>> <rosenvi...@gmail.com>
>>>> >>>>>>>> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>> This has been proposed before:
>>>> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267
>>>> >>>>>>>>>
>>>> >>>>>>>>> There's currently tighter coupling between the Python and Java
>>>> >>>>>>>>> halves
>>>> >>>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if we did
>>>> >>>>>>>>> this, I bet
>>>> >>>>>>>>> we'd run into tons of issues when users try to run a newer
>>>> >>>>>>>>> version of the
>>>> >>>>>>>>> Python half of PySpark against an older set of Java components
>>>> >>>>>>>>> or
>>>> >>>>>>>>> vice-versa.
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot
>>>> >>>>>>>>> <o.girar...@lateral-thoughts.com> wrote:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Hi everyone,
>>>> >>>>>>>>>> Considering the python API as just a front needing the
>>>> >>>>>>>>>> SPARK_HOME
>>>> >>>>>>>>>> defined anyway, I think it would be interesting to deploy the
>>>> >>>>>>>>>> Python part of
>>>> >>>>>>>>>> Spark on PyPi in order to handle the dependencies in a Python
>>>> >>>>>>>>>> project
>>>> >>>>>>>>>> needing PySpark via pip.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> For now I just symlink the python/pyspark in my python
>>>> >>>>>>>>>> install dir
>>>> >>>>>>>>>> site-packages/ in order for PyCharm or other lint tools to
>>>> >>>>>>>>>> work properly.
>>>> >>>>>>>>>> I can do the setup.py work or anything.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> What do you think ?
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Regards,
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Olivier.
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>
>>>> >>>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Brian E. Granger
>>>> > Cal Poly State University, San Luis Obispo
>>>> > @ellisonbg on Twitter and GitHub
>>>> > bgran...@calpoly.edu and elliso...@gmail.com
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>
>>>
>>
>




-- 
Brian E. Granger
Associate Professor of Physics and Data Science
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
bgran...@calpoly.edu and elliso...@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: PySpark on PyPi

Reply via email to