Re: Thoughts on Cloudpickle Update

Hyukjin Kwon Fri, 19 Jan 2018 00:31:33 -0800

> So given that it fixes some real world bugs, any particular reason why?
Would you be comfortable with doing it in 2.3.1?


Ah, I don't feel strongly about this but RC2 will be running on and
cloudpickle's quite core fix to PySpark. Just thought we might want to have
enough time with it.

One worry is, upgrading it includes a fix about namedtuple too where
PySpark has a custom fix.
I would like to check few things about this.

So, yea, it's vague. I wouldn't stay against if you'd prefer.




2018-01-19 16:42 GMT+09:00 Holden Karau <[email protected]>:

>
>
> On Jan 19, 2018 7:28 PM, "Hyukjin Kwon" <[email protected]> wrote:
>
> > Is it an option to match the latest version of cloudpickle and still
> set protocol level 2?
>
> IMHO, I think this can be an option but I am not fully sure yet if we
> should/could go ahead for it within Spark 2.X. I need some
> investigations including things about Pyrolite.
>
> Let's go ahead with matching it to 0.4.2 first. I am quite clear on
> matching it to 0.4.2 at least.
>
> So given that there is a follow up on which fixes a regression if we're
> not comfortable doing the latest version let's double-check that the
> version we do upgrade to doesn't have that regression.
>
>
>
> > I agree that upgrading to try and match version 0.4.2 would be a good
> starting point. Unless no one objects, I will open up a JIRA and try to do
> this.
>
> Yup but I think we shouldn't make this into Spark 2.3.0 to be clear.
>
> So given that it fixes some real world bugs, any particular reason why?
> Would you be comfortable with doing it in 2.3.1?
>
>
>
> > Also lets try to keep track in our commit messages which version of
> cloudpickle we end up upgrading to.
>
> +1: PR description, commit message or any unit to identify each will be 
> useful.
> It should be easier once we have a matched version.
>
>
>
> 2018-01-19 12:55 GMT+09:00 Holden Karau <[email protected]>:
>
>> So if there are different version of Python on the cluster machines I
>> think that's already unsupported so I'm not worried about that.
>>
>> I'd suggest going to the highest released version since there appear to
>> be some useful fixes between 0.4.2 & 0.5.2
>>
>> Also lets try to keep track in our commit messages which version of
>> cloudpickle we end up upgrading to.
>>
>> On Thu, Jan 18, 2018 at 5:45 PM, Bryan Cutler <[email protected]> wrote:
>>
>>> Thanks for all the details and background Hyukjin! Regarding the pickle
>>> protocol change, if I understand correctly, it is currently at level 2 in
>>> Spark which is good for backwards compatibility for all of Python 2.
>>> Choosing HIGHEST_PROTOCOL, which is the default for cloudpickle 0.5.0 and
>>> above, will pick a level determined by your Python version. So is the
>>> concern here for Spark if someone has different versions of Python in their
>>> cluster, like 3.5 and 3.3, then different protocols will be used and
>>> deserialization might fail?  Is it an option to match the latest version of
>>> cloudpickle and still set protocol level 2?
>>>
>>> I agree that upgrading to try and match version 0.4.2 would be a good
>>> starting point. Unless no one objects, I will open up a JIRA and try to do
>>> this.
>>>
>>> Thanks,
>>> Bryan
>>>
>>> On Mon, Jan 15, 2018 at 7:57 PM, Hyukjin Kwon <[email protected]>
>>> wrote:
>>>
>>>> Hi Bryan,
>>>>
>>>> Yup, I support to match the version. I pushed it forward before to
>>>> match it with https://github.com/cloudpipe/cloudpickle
>>>> before few times in Spark's copy and also cloudpickle itself with few
>>>> fixes. I believe our copy is closest to 0.4.1.
>>>>
>>>> I have been trying to follow up the changes in cloudpipe/cloudpickle
>>>> for which version we should match, I think we should match
>>>> it with 0.4.2 first (I need to double check) because IMHO they have
>>>> been adding rather radical changes from 0.5.0, including
>>>> pickle protocol change (by default).
>>>>
>>>> Personally, I would like to match it with the latest because there have
>>>> been some important changes. For
>>>> example, see this too - https://github.com/cloudpipe
>>>> /cloudpickle/pull/138 (it's pending for reviewing yet) eventually but
>>>> 0.4.2 should be
>>>> a good start point.
>>>>
>>>> For the strategy, I think we can match it and follow 0.4.x within Spark
>>>> for the conservative and safe choice + minimal cost.
>>>>
>>>>
>>>> I tried to leave few explicit answers to the questions from you, Bryan:
>>>>
>>>> > Spark is currently using a forked version and it seems like updates
>>>> are made every now and then when
>>>> > needed, but it's not really clear where the current state is and how
>>>> much it has diverged.
>>>>
>>>> I am quite sure our cloudpickle copy is closer to 0.4.1 IIRC.
>>>>
>>>>
>>>> > Are there any known issues with recent changes from those that follow
>>>> cloudpickle dev?
>>>>
>>>> I am technically involved in cloudpickle dev although less active.
>>>> They changed default pickle protocol (https://github.com/cloudpipe/
>>>> cloudpickle/pull/127). So, if we target 0.5.x+, we should double check
>>>> the potential compatibility issue, or fix the protocol, which I believe
>>>> is introduced from 0.5.x.
>>>>
>>>>
>>>>
>>>> 2018-01-16 11:43 GMT+09:00 Bryan Cutler <[email protected]>:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I've seen a couple issues lately related to cloudpickle, notably
>>>>> https://issues.apache.org/jira/browse/SPARK-22674, and would like to
>>>>> get some feedback on updating the version in PySpark which should fix 
>>>>> these
>>>>> issues and allow us to remove some workarounds.  Spark is currently using 
>>>>> a
>>>>> forked version and it seems like updates are made every now and then when
>>>>> needed, but it's not really clear where the current state is and how much
>>>>> it has diverged.  This makes back-porting fixes difficult.  There was a
>>>>> previous discussion on moving it to a dependency here
>>>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-DISCUSS-Moving-to-cloudpickle-and-or-Py4J-as-a-dependencies-td20954.html>,
>>>>> but given the status right now I think it would be best to do another
>>>>> update and bring things closer to upstream before we talk about completely
>>>>> moving it outside of Spark.  Before starting another update, it might be
>>>>> good to discuss the strategy a little.  Should the version in Spark be
>>>>> derived from a release or at least tied to a specific commit?  It would
>>>>> also be good if we can document where it has diverged.  Are there any 
>>>>> known
>>>>> issues with recent changes from those that follow cloudpickle dev?  Any
>>>>> other thoughts or concerns?
>>>>>
>>>>> Thanks,
>>>>> Bryan
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>

Re: Thoughts on Cloudpickle Update

Reply via email to