Re: Thoughts on Cloudpickle Update

Hyukjin Kwon Mon, 15 Jan 2018 19:58:19 -0800

Hi Bryan,

Yup, I support to match the version. I pushed it forward before to match it
with https://github.com/cloudpipe/cloudpickle
before few times in Spark's copy and also cloudpickle itself with few
fixes. I believe our copy is closest to 0.4.1.


I have been trying to follow up the changes in cloudpipe/cloudpickle for
which version we should match, I think we should match
it with 0.4.2 first (I need to double check) because IMHO they have been
adding rather radical changes from 0.5.0, including
pickle protocol change (by default).

Personally, I would like to match it with the latest because there have
been some important changes. For
example, see this too - https://github.com/cloudpipe/cloudpickle/pull/138
(it's pending for reviewing yet) eventually but 0.4.2 should be
a good start point.

For the strategy, I think we can match it and follow 0.4.x within Spark for
the conservative and safe choice + minimal cost.


I tried to leave few explicit answers to the questions from you, Bryan:

> Spark is currently using a forked version and it seems like updates are
made every now and then when
> needed, but it's not really clear where the current state is and how much
it has diverged.

I am quite sure our cloudpickle copy is closer to 0.4.1 IIRC.


> Are there any known issues with recent changes from those that follow
cloudpickle dev?

I am technically involved in cloudpickle dev although less active.
They changed default pickle protocol (
https://github.com/cloudpipe/cloudpickle/pull/127). So, if we target
0.5.x+, we should double check
the potential compatibility issue, or fix the protocol, which I believe is
introduced from 0.5.x.



2018-01-16 11:43 GMT+09:00 Bryan Cutler <[email protected]>:

> Hi All,
>
> I've seen a couple issues lately related to cloudpickle, notably
> https://issues.apache.org/jira/browse/SPARK-22674, and would like to get
> some feedback on updating the version in PySpark which should fix these
> issues and allow us to remove some workarounds.  Spark is currently using a
> forked version and it seems like updates are made every now and then when
> needed, but it's not really clear where the current state is and how much
> it has diverged.  This makes back-porting fixes difficult.  There was a
> previous discussion on moving it to a dependency here
> <http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-DISCUSS-Moving-to-cloudpickle-and-or-Py4J-as-a-dependencies-td20954.html>,
> but given the status right now I think it would be best to do another
> update and bring things closer to upstream before we talk about completely
> moving it outside of Spark.  Before starting another update, it might be
> good to discuss the strategy a little.  Should the version in Spark be
> derived from a release or at least tied to a specific commit?  It would
> also be good if we can document where it has diverged.  Are there any known
> issues with recent changes from those that follow cloudpickle dev?  Any
> other thoughts or concerns?
>
> Thanks,
> Bryan
>

Re: Thoughts on Cloudpickle Update

Reply via email to