Auberon, can you also post this to the Jupyter Google Group? On Wed, Aug 19, 2015 at 12:23 PM, Auberon Lopez <auberon.lo...@gmail.com> wrote: > Hi all, > > I've created an updated PR for this based off of the previous work of > @prabinb: > https://github.com/apache/spark/pull/8318 > > I am not very familiar with python packaging; feedback is appreciated. > > -Auberon > > On Mon, Aug 10, 2015 at 12:45 PM, MinRK <benjami...@gmail.com> wrote: >> >> >> On Mon, Aug 10, 2015 at 12:28 PM, Matt Goodman <meawo...@gmail.com> wrote: >>> >>> I would tentatively suggest also conda packaging. >> >> >> A conda package has the advantage that it can be set up without >> 'installing' the pyspark files, while the PyPI packaging is still being >> worked out. It can just add a pyspark.pth file pointing to pyspark, py4j >> locations. But I think it's a really good idea to package with conda. >> >> -MinRK >> >>> >>> >>> http://conda.pydata.org/docs/ >>> >>> --Matthew Goodman >>> >>> ===================== >>> Check Out My Website: http://craneium.net >>> Find me on LinkedIn: http://tinyurl.com/d6wlch >>> >>> On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu <dav...@databricks.com> >>> wrote: >>>> >>>> I think so, any contributions on this are welcome. >>>> >>>> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger <elliso...@gmail.com> >>>> wrote: >>>> > Sorry, trying to follow the context here. Does it look like there is >>>> > support for the idea of creating a setup.py file and pypi package for >>>> > pyspark? >>>> > >>>> > Cheers, >>>> > >>>> > Brian >>>> > >>>> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu <dav...@databricks.com> >>>> > wrote: >>>> >> We could do that after 1.5 released, it will have same release cycle >>>> >> as Spark in the future. >>>> >> >>>> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot >>>> >> <o.girar...@lateral-thoughts.com> wrote: >>>> >>> +1 (once again :) ) >>>> >>> >>>> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang <justin.u...@gmail.com>: >>>> >>>> >>>> >>>> // ping >>>> >>>> >>>> >>>> do we have any signoff from the pyspark devs to submit a PR to >>>> >>>> publish to >>>> >>>> PyPI? >>>> >>>> >>>> >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman >>>> >>>> <freeman.jer...@gmail.com> >>>> >>>> wrote: >>>> >>>>> >>>> >>>>> Hey all, great discussion, just wanted to +1 that I see a lot of >>>> >>>>> value in >>>> >>>>> steps that make it easier to use PySpark as an ordinary python >>>> >>>>> library. >>>> >>>>> >>>> >>>>> You might want to check out this >>>> >>>>> (https://github.com/minrk/findspark), >>>> >>>>> started by Jupyter project devs, that offers one way to facilitate >>>> >>>>> this >>>> >>>>> stuff. I’ve also cced them here to join the conversation. >>>> >>>>> >>>> >>>>> Also, @Jey, I can also confirm that at least in some scenarios >>>> >>>>> (I’ve done >>>> >>>>> it in an EC2 cluster in standalone mode) it’s possible to run >>>> >>>>> PySpark jobs >>>> >>>>> just using `from pyspark import SparkContext; sc = >>>> >>>>> SparkContext(master=“X”)` >>>> >>>>> so long as the environmental variables (PYTHONPATH and >>>> >>>>> PYSPARK_PYTHON) are >>>> >>>>> set correctly on *both* workers and driver. That said, there’s >>>> >>>>> definitely >>>> >>>>> additional configuration / functionality that would require going >>>> >>>>> through >>>> >>>>> the proper submit scripts. >>>> >>>>> >>>> >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal >>>> >>>>> <punya.bis...@gmail.com> >>>> >>>>> wrote: >>>> >>>>> >>>> >>>>> I agree with everything Justin just said. An additional advantage >>>> >>>>> of >>>> >>>>> publishing PySpark's Python code in a standards-compliant way is >>>> >>>>> the fact >>>> >>>>> that we'll be able to declare transitive dependencies (Pandas, >>>> >>>>> Py4J) in a >>>> >>>>> way that pip can use. Contrast this with the current situation, >>>> >>>>> where >>>> >>>>> df.toPandas() exists in the Spark API but doesn't actually work >>>> >>>>> until you >>>> >>>>> install Pandas. >>>> >>>>> >>>> >>>>> Punya >>>> >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang >>>> >>>>> <justin.u...@gmail.com> >>>> >>>>> wrote: >>>> >>>>>> >>>> >>>>>> // + Davies for his comments >>>> >>>>>> // + Punya for SA >>>> >>>>>> >>>> >>>>>> For development and CI, like Olivier mentioned, I think it would >>>> >>>>>> be >>>> >>>>>> hugely beneficial to publish pyspark (only code in the python/ >>>> >>>>>> dir) on PyPI. >>>> >>>>>> If anyone wants to develop against PySpark APIs, they need to >>>> >>>>>> download the >>>> >>>>>> distribution and do a lot of PYTHONPATH munging for all the tools >>>> >>>>>> (pylint, >>>> >>>>>> pytest, IDE code completion). Right now that involves adding >>>> >>>>>> python/ and >>>> >>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to >>>> >>>>>> add more >>>> >>>>>> dependencies, we would have to manually mirror all the PYTHONPATH >>>> >>>>>> munging in >>>> >>>>>> the ./pyspark script. With a proper pyspark setup.py which >>>> >>>>>> declares its >>>> >>>>>> dependencies, and a published distribution, depending on pyspark >>>> >>>>>> will just >>>> >>>>>> be adding pyspark to my setup.py dependencies. >>>> >>>>>> >>>> >>>>>> Of course, if we actually want to run parts of pyspark that is >>>> >>>>>> backed by >>>> >>>>>> Py4J calls, then we need the full spark distribution with either >>>> >>>>>> ./pyspark >>>> >>>>>> or ./spark-submit, but for things like linting and development, >>>> >>>>>> the >>>> >>>>>> PYTHONPATH munging is very annoying. >>>> >>>>>> >>>> >>>>>> I don't think the version-mismatch issues are a compelling reason >>>> >>>>>> to not >>>> >>>>>> go ahead with PyPI publishing. At runtime, we should definitely >>>> >>>>>> enforce that >>>> >>>>>> the version has to be exact, which means there is no backcompat >>>> >>>>>> nightmare as >>>> >>>>>> suggested by Davies in >>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267. >>>> >>>>>> This would mean that even if the user got his pip installed >>>> >>>>>> pyspark to >>>> >>>>>> somehow get loaded before the spark distribution provided >>>> >>>>>> pyspark, then the >>>> >>>>>> user would be alerted immediately. >>>> >>>>>> >>>> >>>>>> Davies, if you buy this, should me or someone on my team pick up >>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and >>>> >>>>>> https://github.com/apache/spark/pull/464? >>>> >>>>>> >>>> >>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot >>>> >>>>>> <o.girar...@lateral-thoughts.com> wrote: >>>> >>>>>>> >>>> >>>>>>> Ok, I get it. Now what can we do to improve the current >>>> >>>>>>> situation, >>>> >>>>>>> because right now if I want to set-up a CI env for PySpark, I >>>> >>>>>>> have to : >>>> >>>>>>> 1- download a pre-built version of pyspark and unzip it >>>> >>>>>>> somewhere on >>>> >>>>>>> every agent >>>> >>>>>>> 2- define the SPARK_HOME env >>>> >>>>>>> 3- symlink this distribution pyspark dir inside the python >>>> >>>>>>> install dir >>>> >>>>>>> site-packages/ directory >>>> >>>>>>> and if I rely on additional packages (like databricks' Spark-CSV >>>> >>>>>>> project), I have to (except if I'm mistaken) >>>> >>>>>>> 4- compile/assembly spark-csv, deploy the jar in a specific >>>> >>>>>>> directory >>>> >>>>>>> on every agent >>>> >>>>>>> 5- add this jar-filled directory to the Spark distribution's >>>> >>>>>>> additional >>>> >>>>>>> classpath using the conf/spark-default file >>>> >>>>>>> >>>> >>>>>>> Then finally we can launch our unit/integration-tests. >>>> >>>>>>> Some issues are related to spark-packages, some to the lack of >>>> >>>>>>> python-based dependency, and some to the way SparkContext are >>>> >>>>>>> launched when >>>> >>>>>>> using pyspark. >>>> >>>>>>> I think step 1 and 2 are fair enough >>>> >>>>>>> 4 and 5 may already have solutions, I didn't check and >>>> >>>>>>> considering >>>> >>>>>>> spark-shell is downloading such dependencies automatically, I >>>> >>>>>>> think if >>>> >>>>>>> nothing's done yet it will (I guess ?). >>>> >>>>>>> >>>> >>>>>>> For step 3, maybe just adding a setup.py to the distribution >>>> >>>>>>> would be >>>> >>>>>>> enough, I'm not exactly advocating to distribute a full 300Mb >>>> >>>>>>> spark >>>> >>>>>>> distribution in PyPi, maybe there's a better compromise ? >>>> >>>>>>> >>>> >>>>>>> Regards, >>>> >>>>>>> >>>> >>>>>>> Olivier. >>>> >>>>>>> >>>> >>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam <j...@cs.berkeley.edu> >>>> >>>>>>> a écrit >>>> >>>>>>> : >>>> >>>>>>>> >>>> >>>>>>>> Couldn't we have a pip installable "pyspark" package that just >>>> >>>>>>>> serves >>>> >>>>>>>> as a shim to an existing Spark installation? Or it could even >>>> >>>>>>>> download the >>>> >>>>>>>> latest Spark binary if SPARK_HOME isn't set during >>>> >>>>>>>> installation. Right now, >>>> >>>>>>>> Spark doesn't play very well with the usual Python ecosystem. >>>> >>>>>>>> For example, >>>> >>>>>>>> why do I need to use a strange incantation when booting up >>>> >>>>>>>> IPython if I want >>>> >>>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It would >>>> >>>>>>>> be much nicer >>>> >>>>>>>> to just type `from pyspark import SparkContext; sc = >>>> >>>>>>>> SparkContext("local[4]")` in my notebook. >>>> >>>>>>>> >>>> >>>>>>>> I did a test and it seems like PySpark's basic unit-tests do >>>> >>>>>>>> pass when >>>> >>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH: >>>> >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH >>>> >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py >>>> >>>>>>>> >>>> >>>>>>>> -Jey >>>> >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen >>>> >>>>>>>> <rosenvi...@gmail.com> >>>> >>>>>>>> wrote: >>>> >>>>>>>>> >>>> >>>>>>>>> This has been proposed before: >>>> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267 >>>> >>>>>>>>> >>>> >>>>>>>>> There's currently tighter coupling between the Python and Java >>>> >>>>>>>>> halves >>>> >>>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if we did >>>> >>>>>>>>> this, I bet >>>> >>>>>>>>> we'd run into tons of issues when users try to run a newer >>>> >>>>>>>>> version of the >>>> >>>>>>>>> Python half of PySpark against an older set of Java components >>>> >>>>>>>>> or >>>> >>>>>>>>> vice-versa. >>>> >>>>>>>>> >>>> >>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot >>>> >>>>>>>>> <o.girar...@lateral-thoughts.com> wrote: >>>> >>>>>>>>>> >>>> >>>>>>>>>> Hi everyone, >>>> >>>>>>>>>> Considering the python API as just a front needing the >>>> >>>>>>>>>> SPARK_HOME >>>> >>>>>>>>>> defined anyway, I think it would be interesting to deploy the >>>> >>>>>>>>>> Python part of >>>> >>>>>>>>>> Spark on PyPi in order to handle the dependencies in a Python >>>> >>>>>>>>>> project >>>> >>>>>>>>>> needing PySpark via pip. >>>> >>>>>>>>>> >>>> >>>>>>>>>> For now I just symlink the python/pyspark in my python >>>> >>>>>>>>>> install dir >>>> >>>>>>>>>> site-packages/ in order for PyCharm or other lint tools to >>>> >>>>>>>>>> work properly. >>>> >>>>>>>>>> I can do the setup.py work or anything. >>>> >>>>>>>>>> >>>> >>>>>>>>>> What do you think ? >>>> >>>>>>>>>> >>>> >>>>>>>>>> Regards, >>>> >>>>>>>>>> >>>> >>>>>>>>>> Olivier. >>>> >>>>>>>>> >>>> >>>>>>>>> >>>> >>>>>>>> >>>> >>>>> >>>> >>> >>>> > >>>> > >>>> > >>>> > -- >>>> > Brian E. Granger >>>> > Cal Poly State University, San Luis Obispo >>>> > @ellisonbg on Twitter and GitHub >>>> > bgran...@calpoly.edu and elliso...@gmail.com >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: dev-h...@spark.apache.org >>>> >>> >> >
-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub bgran...@calpoly.edu and elliso...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org