Thanks for all the details and background Hyukjin! Regarding the pickle
protocol change, if I understand correctly, it is currently at level 2 in
Spark which is good for backwards compatibility for all of Python 2.
Choosing HIGHEST_PROTOCOL, which is the default for cloudpickle 0.5.0 and
above, will pick a level determined by your Python version. So is the
concern here for Spark if someone has different versions of Python in their
cluster, like 3.5 and 3.3, then different protocols will be used and
deserialization might fail?  Is it an option to match the latest version of
cloudpickle and still set protocol level 2?

I agree that upgrading to try and match version 0.4.2 would be a good
starting point. Unless no one objects, I will open up a JIRA and try to do
this.

Thanks,
Bryan

On Mon, Jan 15, 2018 at 7:57 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:

> Hi Bryan,
>
> Yup, I support to match the version. I pushed it forward before to match
> it with https://github.com/cloudpipe/cloudpickle
> before few times in Spark's copy and also cloudpickle itself with few
> fixes. I believe our copy is closest to 0.4.1.
>
> I have been trying to follow up the changes in cloudpipe/cloudpickle for
> which version we should match, I think we should match
> it with 0.4.2 first (I need to double check) because IMHO they have been
> adding rather radical changes from 0.5.0, including
> pickle protocol change (by default).
>
> Personally, I would like to match it with the latest because there have
> been some important changes. For
> example, see this too - https://github.com/cloudpipe/cloudpickle/pull/138
> (it's pending for reviewing yet) eventually but 0.4.2 should be
> a good start point.
>
> For the strategy, I think we can match it and follow 0.4.x within Spark
> for the conservative and safe choice + minimal cost.
>
>
> I tried to leave few explicit answers to the questions from you, Bryan:
>
> > Spark is currently using a forked version and it seems like updates are
> made every now and then when
> > needed, but it's not really clear where the current state is and how
> much it has diverged.
>
> I am quite sure our cloudpickle copy is closer to 0.4.1 IIRC.
>
>
> > Are there any known issues with recent changes from those that follow
> cloudpickle dev?
>
> I am technically involved in cloudpickle dev although less active.
> They changed default pickle protocol (https://github.com/cloudpipe/
> cloudpickle/pull/127). So, if we target 0.5.x+, we should double check
> the potential compatibility issue, or fix the protocol, which I believe is
> introduced from 0.5.x.
>
>
>
> 2018-01-16 11:43 GMT+09:00 Bryan Cutler <cutl...@gmail.com>:
>
>> Hi All,
>>
>> I've seen a couple issues lately related to cloudpickle, notably
>> https://issues.apache.org/jira/browse/SPARK-22674, and would like to get
>> some feedback on updating the version in PySpark which should fix these
>> issues and allow us to remove some workarounds.  Spark is currently using a
>> forked version and it seems like updates are made every now and then when
>> needed, but it's not really clear where the current state is and how much
>> it has diverged.  This makes back-porting fixes difficult.  There was a
>> previous discussion on moving it to a dependency here
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-DISCUSS-Moving-to-cloudpickle-and-or-Py4J-as-a-dependencies-td20954.html>,
>> but given the status right now I think it would be best to do another
>> update and bring things closer to upstream before we talk about completely
>> moving it outside of Spark.  Before starting another update, it might be
>> good to discuss the strategy a little.  Should the version in Spark be
>> derived from a release or at least tied to a specific commit?  It would
>> also be good if we can document where it has diverged.  Are there any known
>> issues with recent changes from those that follow cloudpickle dev?  Any
>> other thoughts or concerns?
>>
>> Thanks,
>> Bryan
>>
>
>

Reply via email to