Re: Thoughts on Cloudpickle Update

Hyukjin Kwon Fri, 19 Jan 2018 01:33:08 -0800

Yea, that sounds good to me.

2018-01-19 18:29 GMT+09:00 Holden Karau <[email protected]>:


> So it is pretty core, but its one of the better indirectly tested
> components. I think probably the most reasonable path is to see what the
> diff ends up looking like and make a call at that point for if we want it
> to go to master or master & branch-2.3?
>
> On Fri, Jan 19, 2018 at 12:30 AM, Hyukjin Kwon <[email protected]>
> wrote:
>
>> > So given that it fixes some real world bugs, any particular reason
>> why? Would you be comfortable with doing it in 2.3.1?
>>
>> Ah, I don't feel strongly about this but RC2 will be running on and
>> cloudpickle's quite core fix to PySpark. Just thought we might want to have
>> enough time with it.
>>
>> One worry is, upgrading it includes a fix about namedtuple too where
>> PySpark has a custom fix.
>> I would like to check few things about this.
>>
>> So, yea, it's vague. I wouldn't stay against if you'd prefer.
>>
>>
>>
>>
>> 2018-01-19 16:42 GMT+09:00 Holden Karau <[email protected]>:
>>
>>>
>>>
>>> On Jan 19, 2018 7:28 PM, "Hyukjin Kwon" <[email protected]> wrote:
>>>
>>> > Is it an option to match the latest version of cloudpickle and still
>>> set protocol level 2?
>>>
>>> IMHO, I think this can be an option but I am not fully sure yet if we
>>> should/could go ahead for it within Spark 2.X. I need some
>>> investigations including things about Pyrolite.
>>>
>>> Let's go ahead with matching it to 0.4.2 first. I am quite clear on
>>> matching it to 0.4.2 at least.
>>>
>>> So given that there is a follow up on which fixes a regression if we're
>>> not comfortable doing the latest version let's double-check that the
>>> version we do upgrade to doesn't have that regression.
>>>
>>>
>>>
>>> > I agree that upgrading to try and match version 0.4.2 would be a good
>>> starting point. Unless no one objects, I will open up a JIRA and try to do
>>> this.
>>>
>>> Yup but I think we shouldn't make this into Spark 2.3.0 to be clear.
>>>
>>> So given that it fixes some real world bugs, any particular reason why?
>>> Would you be comfortable with doing it in 2.3.1?
>>>
>>>
>>>
>>> > Also lets try to keep track in our commit messages which version of
>>> cloudpickle we end up upgrading to.
>>>
>>> +1: PR description, commit message or any unit to identify each will be 
>>> useful.
>>> It should be easier once we have a matched version.
>>>
>>>
>>>
>>> 2018-01-19 12:55 GMT+09:00 Holden Karau <[email protected]>:
>>>
>>>> So if there are different version of Python on the cluster machines I
>>>> think that's already unsupported so I'm not worried about that.
>>>>
>>>> I'd suggest going to the highest released version since there appear to
>>>> be some useful fixes between 0.4.2 & 0.5.2
>>>>
>>>> Also lets try to keep track in our commit messages which version of
>>>> cloudpickle we end up upgrading to.
>>>>
>>>> On Thu, Jan 18, 2018 at 5:45 PM, Bryan Cutler <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks for all the details and background Hyukjin! Regarding the
>>>>> pickle protocol change, if I understand correctly, it is currently at 
>>>>> level
>>>>> 2 in Spark which is good for backwards compatibility for all of Python 2.
>>>>> Choosing HIGHEST_PROTOCOL, which is the default for cloudpickle 0.5.0 and
>>>>> above, will pick a level determined by your Python version. So is the
>>>>> concern here for Spark if someone has different versions of Python in 
>>>>> their
>>>>> cluster, like 3.5 and 3.3, then different protocols will be used and
>>>>> deserialization might fail?  Is it an option to match the latest version 
>>>>> of
>>>>> cloudpickle and still set protocol level 2?
>>>>>
>>>>> I agree that upgrading to try and match version 0.4.2 would be a good
>>>>> starting point. Unless no one objects, I will open up a JIRA and try to do
>>>>> this.
>>>>>
>>>>> Thanks,
>>>>> Bryan
>>>>>
>>>>> On Mon, Jan 15, 2018 at 7:57 PM, Hyukjin Kwon <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Bryan,
>>>>>>
>>>>>> Yup, I support to match the version. I pushed it forward before to
>>>>>> match it with https://github.com/cloudpipe/cloudpickle
>>>>>> before few times in Spark's copy and also cloudpickle itself with few
>>>>>> fixes. I believe our copy is closest to 0.4.1.
>>>>>>
>>>>>> I have been trying to follow up the changes in cloudpipe/cloudpickle
>>>>>> for which version we should match, I think we should match
>>>>>> it with 0.4.2 first (I need to double check) because IMHO they have
>>>>>> been adding rather radical changes from 0.5.0, including
>>>>>> pickle protocol change (by default).
>>>>>>
>>>>>> Personally, I would like to match it with the latest because there
>>>>>> have been some important changes. For
>>>>>> example, see this too - https://github.com/cloudpipe
>>>>>> /cloudpickle/pull/138 (it's pending for reviewing yet) eventually
>>>>>> but 0.4.2 should be
>>>>>> a good start point.
>>>>>>
>>>>>> For the strategy, I think we can match it and follow 0.4.x within
>>>>>> Spark for the conservative and safe choice + minimal cost.
>>>>>>
>>>>>>
>>>>>> I tried to leave few explicit answers to the questions from you,
>>>>>> Bryan:
>>>>>>
>>>>>> > Spark is currently using a forked version and it seems like
>>>>>> updates are made every now and then when
>>>>>> > needed, but it's not really clear where the current state is and
>>>>>> how much it has diverged.
>>>>>>
>>>>>> I am quite sure our cloudpickle copy is closer to 0.4.1 IIRC.
>>>>>>
>>>>>>
>>>>>> > Are there any known issues with recent changes from those that
>>>>>> follow cloudpickle dev?
>>>>>>
>>>>>> I am technically involved in cloudpickle dev although less active.
>>>>>> They changed default pickle protocol (https://github.com/cloudpipe/
>>>>>> cloudpickle/pull/127). So, if we target 0.5.x+, we should double
>>>>>> check
>>>>>> the potential compatibility issue, or fix the protocol, which I
>>>>>> believe is introduced from 0.5.x.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2018-01-16 11:43 GMT+09:00 Bryan Cutler <[email protected]>:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> I've seen a couple issues lately related to cloudpickle, notably
>>>>>>> https://issues.apache.org/jira/browse/SPARK-22674, and would like
>>>>>>> to get some feedback on updating the version in PySpark which should fix
>>>>>>> these issues and allow us to remove some workarounds.  Spark is 
>>>>>>> currently
>>>>>>> using a forked version and it seems like updates are made every now and
>>>>>>> then when needed, but it's not really clear where the current state is 
>>>>>>> and
>>>>>>> how much it has diverged.  This makes back-porting fixes difficult.  
>>>>>>> There
>>>>>>> was a previous discussion on moving it to a dependency here
>>>>>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-DISCUSS-Moving-to-cloudpickle-and-or-Py4J-as-a-dependencies-td20954.html>,
>>>>>>> but given the status right now I think it would be best to do another
>>>>>>> update and bring things closer to upstream before we talk about 
>>>>>>> completely
>>>>>>> moving it outside of Spark.  Before starting another update, it might be
>>>>>>> good to discuss the strategy a little.  Should the version in Spark be
>>>>>>> derived from a release or at least tied to a specific commit?  It would
>>>>>>> also be good if we can document where it has diverged.  Are there any 
>>>>>>> known
>>>>>>> issues with recent changes from those that follow cloudpickle dev?  Any
>>>>>>> other thoughts or concerns?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Bryan
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>>
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>

Re: Thoughts on Cloudpickle Update

Reply via email to