Re: Thoughts on Cloudpickle Update

Hyukjin Kwon Thu, 18 Jan 2018 22:28:49 -0800

> Is it an option to match the latest version of cloudpickle and still set
protocol level 2?


IMHO, I think this can be an option but I am not fully sure yet if we
should/could go ahead for it within Spark 2.X. I need some
investigations including things about Pyrolite.

Let's go ahead with matching it to 0.4.2 first. I am quite clear on
matching it to 0.4.2 at least.


> I agree that upgrading to try and match version 0.4.2 would be a good
starting point. Unless no one objects, I will open up a JIRA and try to do
this.

Yup but I think we shouldn't make this into Spark 2.3.0 to be clear.


> Also lets try to keep track in our commit messages which version of
cloudpickle we end up upgrading to.

+1: PR description, commit message or any unit to identify each will be useful.
It should be easier once we have a matched version.



2018-01-19 12:55 GMT+09:00 Holden Karau <[email protected]>:

> So if there are different version of Python on the cluster machines I
> think that's already unsupported so I'm not worried about that.
>
> I'd suggest going to the highest released version since there appear to be
> some useful fixes between 0.4.2 & 0.5.2
>
> Also lets try to keep track in our commit messages which version of
> cloudpickle we end up upgrading to.
>
> On Thu, Jan 18, 2018 at 5:45 PM, Bryan Cutler <[email protected]> wrote:
>
>> Thanks for all the details and background Hyukjin! Regarding the pickle
>> protocol change, if I understand correctly, it is currently at level 2 in
>> Spark which is good for backwards compatibility for all of Python 2.
>> Choosing HIGHEST_PROTOCOL, which is the default for cloudpickle 0.5.0 and
>> above, will pick a level determined by your Python version. So is the
>> concern here for Spark if someone has different versions of Python in their
>> cluster, like 3.5 and 3.3, then different protocols will be used and
>> deserialization might fail?  Is it an option to match the latest version of
>> cloudpickle and still set protocol level 2?
>>
>> I agree that upgrading to try and match version 0.4.2 would be a good
>> starting point. Unless no one objects, I will open up a JIRA and try to do
>> this.
>>
>> Thanks,
>> Bryan
>>
>> On Mon, Jan 15, 2018 at 7:57 PM, Hyukjin Kwon <[email protected]>
>> wrote:
>>
>>> Hi Bryan,
>>>
>>> Yup, I support to match the version. I pushed it forward before to match
>>> it with https://github.com/cloudpipe/cloudpickle
>>> before few times in Spark's copy and also cloudpickle itself with few
>>> fixes. I believe our copy is closest to 0.4.1.
>>>
>>> I have been trying to follow up the changes in cloudpipe/cloudpickle for
>>> which version we should match, I think we should match
>>> it with 0.4.2 first (I need to double check) because IMHO they have been
>>> adding rather radical changes from 0.5.0, including
>>> pickle protocol change (by default).
>>>
>>> Personally, I would like to match it with the latest because there have
>>> been some important changes. For
>>> example, see this too - https://github.com/cloudpipe
>>> /cloudpickle/pull/138 (it's pending for reviewing yet) eventually but
>>> 0.4.2 should be
>>> a good start point.
>>>
>>> For the strategy, I think we can match it and follow 0.4.x within Spark
>>> for the conservative and safe choice + minimal cost.
>>>
>>>
>>> I tried to leave few explicit answers to the questions from you, Bryan:
>>>
>>> > Spark is currently using a forked version and it seems like updates
>>> are made every now and then when
>>> > needed, but it's not really clear where the current state is and how
>>> much it has diverged.
>>>
>>> I am quite sure our cloudpickle copy is closer to 0.4.1 IIRC.
>>>
>>>
>>> > Are there any known issues with recent changes from those that follow
>>> cloudpickle dev?
>>>
>>> I am technically involved in cloudpickle dev although less active.
>>> They changed default pickle protocol (https://github.com/cloudpipe/
>>> cloudpickle/pull/127). So, if we target 0.5.x+, we should double check
>>> the potential compatibility issue, or fix the protocol, which I believe
>>> is introduced from 0.5.x.
>>>
>>>
>>>
>>> 2018-01-16 11:43 GMT+09:00 Bryan Cutler <[email protected]>:
>>>
>>>> Hi All,
>>>>
>>>> I've seen a couple issues lately related to cloudpickle, notably
>>>> https://issues.apache.org/jira/browse/SPARK-22674, and would like to
>>>> get some feedback on updating the version in PySpark which should fix these
>>>> issues and allow us to remove some workarounds.  Spark is currently using a
>>>> forked version and it seems like updates are made every now and then when
>>>> needed, but it's not really clear where the current state is and how much
>>>> it has diverged.  This makes back-porting fixes difficult.  There was a
>>>> previous discussion on moving it to a dependency here
>>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-DISCUSS-Moving-to-cloudpickle-and-or-Py4J-as-a-dependencies-td20954.html>,
>>>> but given the status right now I think it would be best to do another
>>>> update and bring things closer to upstream before we talk about completely
>>>> moving it outside of Spark.  Before starting another update, it might be
>>>> good to discuss the strategy a little.  Should the version in Spark be
>>>> derived from a release or at least tied to a specific commit?  It would
>>>> also be good if we can document where it has diverged.  Are there any known
>>>> issues with recent changes from those that follow cloudpickle dev?  Any
>>>> other thoughts or concerns?
>>>>
>>>> Thanks,
>>>> Bryan
>>>>
>>>
>>>
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>

Re: Thoughts on Cloudpickle Update

Reply via email to