> Is it an option to match the latest version of cloudpickle and still set protocol level 2?
IMHO, I think this can be an option but I am not fully sure yet if we should/could go ahead for it within Spark 2.X. I need some investigations including things about Pyrolite. Let's go ahead with matching it to 0.4.2 first. I am quite clear on matching it to 0.4.2 at least. > I agree that upgrading to try and match version 0.4.2 would be a good starting point. Unless no one objects, I will open up a JIRA and try to do this. Yup but I think we shouldn't make this into Spark 2.3.0 to be clear. > Also lets try to keep track in our commit messages which version of cloudpickle we end up upgrading to. +1: PR description, commit message or any unit to identify each will be useful. It should be easier once we have a matched version. 2018-01-19 12:55 GMT+09:00 Holden Karau <hol...@pigscanfly.ca>: > So if there are different version of Python on the cluster machines I > think that's already unsupported so I'm not worried about that. > > I'd suggest going to the highest released version since there appear to be > some useful fixes between 0.4.2 & 0.5.2 > > Also lets try to keep track in our commit messages which version of > cloudpickle we end up upgrading to. > > On Thu, Jan 18, 2018 at 5:45 PM, Bryan Cutler <cutl...@gmail.com> wrote: > >> Thanks for all the details and background Hyukjin! Regarding the pickle >> protocol change, if I understand correctly, it is currently at level 2 in >> Spark which is good for backwards compatibility for all of Python 2. >> Choosing HIGHEST_PROTOCOL, which is the default for cloudpickle 0.5.0 and >> above, will pick a level determined by your Python version. So is the >> concern here for Spark if someone has different versions of Python in their >> cluster, like 3.5 and 3.3, then different protocols will be used and >> deserialization might fail? Is it an option to match the latest version of >> cloudpickle and still set protocol level 2? >> >> I agree that upgrading to try and match version 0.4.2 would be a good >> starting point. Unless no one objects, I will open up a JIRA and try to do >> this. >> >> Thanks, >> Bryan >> >> On Mon, Jan 15, 2018 at 7:57 PM, Hyukjin Kwon <gurwls...@gmail.com> >> wrote: >> >>> Hi Bryan, >>> >>> Yup, I support to match the version. I pushed it forward before to match >>> it with https://github.com/cloudpipe/cloudpickle >>> before few times in Spark's copy and also cloudpickle itself with few >>> fixes. I believe our copy is closest to 0.4.1. >>> >>> I have been trying to follow up the changes in cloudpipe/cloudpickle for >>> which version we should match, I think we should match >>> it with 0.4.2 first (I need to double check) because IMHO they have been >>> adding rather radical changes from 0.5.0, including >>> pickle protocol change (by default). >>> >>> Personally, I would like to match it with the latest because there have >>> been some important changes. For >>> example, see this too - https://github.com/cloudpipe >>> /cloudpickle/pull/138 (it's pending for reviewing yet) eventually but >>> 0.4.2 should be >>> a good start point. >>> >>> For the strategy, I think we can match it and follow 0.4.x within Spark >>> for the conservative and safe choice + minimal cost. >>> >>> >>> I tried to leave few explicit answers to the questions from you, Bryan: >>> >>> > Spark is currently using a forked version and it seems like updates >>> are made every now and then when >>> > needed, but it's not really clear where the current state is and how >>> much it has diverged. >>> >>> I am quite sure our cloudpickle copy is closer to 0.4.1 IIRC. >>> >>> >>> > Are there any known issues with recent changes from those that follow >>> cloudpickle dev? >>> >>> I am technically involved in cloudpickle dev although less active. >>> They changed default pickle protocol (https://github.com/cloudpipe/ >>> cloudpickle/pull/127). So, if we target 0.5.x+, we should double check >>> the potential compatibility issue, or fix the protocol, which I believe >>> is introduced from 0.5.x. >>> >>> >>> >>> 2018-01-16 11:43 GMT+09:00 Bryan Cutler <cutl...@gmail.com>: >>> >>>> Hi All, >>>> >>>> I've seen a couple issues lately related to cloudpickle, notably >>>> https://issues.apache.org/jira/browse/SPARK-22674, and would like to >>>> get some feedback on updating the version in PySpark which should fix these >>>> issues and allow us to remove some workarounds. Spark is currently using a >>>> forked version and it seems like updates are made every now and then when >>>> needed, but it's not really clear where the current state is and how much >>>> it has diverged. This makes back-porting fixes difficult. There was a >>>> previous discussion on moving it to a dependency here >>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-DISCUSS-Moving-to-cloudpickle-and-or-Py4J-as-a-dependencies-td20954.html>, >>>> but given the status right now I think it would be best to do another >>>> update and bring things closer to upstream before we talk about completely >>>> moving it outside of Spark. Before starting another update, it might be >>>> good to discuss the strategy a little. Should the version in Spark be >>>> derived from a release or at least tied to a specific commit? It would >>>> also be good if we can document where it has diverged. Are there any known >>>> issues with recent changes from those that follow cloudpickle dev? Any >>>> other thoughts or concerns? >>>> >>>> Thanks, >>>> Bryan >>>> >>> >>> >> > > > -- > Twitter: https://twitter.com/holdenkarau >