I don't have any strong views, so just to highlight possible issues: * Based on different issues I've seen there is a substantial amount of users which depend on system wide Python installations. As far as I am aware neither Py4j nor cloudpickle are present in the standard system repositories in Debian or Red Hat derivatives. * Assuming that Spark is committed to supporting Python 2 beyond its end of life we have to be sure that any external dependency has the same policy. * Py4j is missing from default Anaconda channel. Not a big issue, just a small annoyance. * External dependencies with pinned versions add some overhead to the development across versions (effectively we may need a separate env for each major Spark release). I've seen small inconsistencies in PySpark behavior with different Py4j versions so it is not completely hypothetical. * Adding possible version conflicts. It is probably not a big risk but something to consider (for example in combination Blaze + Dask + PySpark). * Adding another party user has to trust.
On 02/14/2017 12:22 AM, Holden Karau wrote: > It's a good question. Py4J seems to have been updated 5 times in 2016 > and is a bit involved (from a review point of view verifying the zip > file contents is somewhat tedious). > > cloudpickle is a bit difficult to tell since we can have changes to > cloudpickle which aren't correctly tagged as backporting changes from > the fork (and this can take awhile to review since we don't always > catch them right away as being backports). > > Another difficulty with looking at backports is that since our review > process for PySpark has historically been on the slow side, changes > benefiting systems like dask or IPython parallel were not backported > to Spark unless they caused serious errors. > > I think the key benefits are better test coverage of the forked > version of cloudpickle, using a more standardized packaging of > dependencies, simpler updates of dependencies reduces friction to > gaining benefits from other related projects work - Python > serialization really isn't our secret sauce. > > If I'm missing any substantial benefits or costs I'd love to know :) > > On Mon, Feb 13, 2017 at 3:03 PM, Reynold Xin <r...@databricks.com > <mailto:r...@databricks.com>> wrote: > > With any dependency update (or refactoring of existing code), I > always ask this question: what's the benefit? In this case it > looks like the benefit is to reduce efforts in backports. Do you > know how often we needed to do those? > > > On Tue, Feb 14, 2017 at 12:01 AM, Holden Karau > <hol...@pigscanfly.ca <mailto:hol...@pigscanfly.ca>> wrote: > > Hi PySpark Developers, > > Cloudpickle is a core part of PySpark, and is originally > copied from (and improved from) picloud. Since then other > projects have found cloudpickle useful and a fork of > cloudpickle <https://github.com/cloudpipe/cloudpickle> was > created and is now maintained as its own library > <https://pypi.python.org/pypi/cloudpickle> (with better test > coverage and resulting bug fixes I understand). We've had a > few PRs backporting fixes from the cloudpickle project into > Spark's local copy of cloudpickle - how would people feel > about moving to taking an explicit (pinned) dependency on > cloudpickle? > > We could add cloudpickle to the setup.py and a > requirements.txt file for users who prefer not to do a system > installation of PySpark. > > Py4J is maybe even a simpler case, we currently have a zip of > py4j in our repo but could instead have a pinned version > required. While we do depend on a lot of py4j internal APIs, > version pinning should be sufficient to ensure functionality > (and simplify the update process). > > Cheers, > > Holden :) > > -- > Twitter: https://twitter.com/holdenkarau > <https://twitter.com/holdenkarau> > > > > > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau -- Maciej Szymkiewicz