Thanks Holden and Hyukjin. I agree, let's start doing the work first and see if it the changes are low risk enough, then we can evaluate how best to proceed. I made https://issues.apache.org/jira/browse/SPARK-23159 and will get started on the update and we can continue to discuss in the PR.
On Fri, Jan 19, 2018 at 1:32 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > Yea, that sounds good to me. > > 2018-01-19 18:29 GMT+09:00 Holden Karau <hol...@pigscanfly.ca>: > >> So it is pretty core, but its one of the better indirectly tested >> components. I think probably the most reasonable path is to see what the >> diff ends up looking like and make a call at that point for if we want it >> to go to master or master & branch-2.3? >> >> On Fri, Jan 19, 2018 at 12:30 AM, Hyukjin Kwon <gurwls...@gmail.com> >> wrote: >> >>> > So given that it fixes some real world bugs, any particular reason >>> why? Would you be comfortable with doing it in 2.3.1? >>> >>> Ah, I don't feel strongly about this but RC2 will be running on and >>> cloudpickle's quite core fix to PySpark. Just thought we might want to have >>> enough time with it. >>> >>> One worry is, upgrading it includes a fix about namedtuple too where >>> PySpark has a custom fix. >>> I would like to check few things about this. >>> >>> So, yea, it's vague. I wouldn't stay against if you'd prefer. >>> >>> >>> >>> >>> 2018-01-19 16:42 GMT+09:00 Holden Karau <hol...@pigscanfly.ca>: >>> >>>> >>>> >>>> On Jan 19, 2018 7:28 PM, "Hyukjin Kwon" <gurwls...@gmail.com> wrote: >>>> >>>> > Is it an option to match the latest version of cloudpickle and still >>>> set protocol level 2? >>>> >>>> IMHO, I think this can be an option but I am not fully sure yet if we >>>> should/could go ahead for it within Spark 2.X. I need some >>>> investigations including things about Pyrolite. >>>> >>>> Let's go ahead with matching it to 0.4.2 first. I am quite clear on >>>> matching it to 0.4.2 at least. >>>> >>>> So given that there is a follow up on which fixes a regression if we're >>>> not comfortable doing the latest version let's double-check that the >>>> version we do upgrade to doesn't have that regression. >>>> >>>> >>>> >>>> > I agree that upgrading to try and match version 0.4.2 would be a >>>> good starting point. Unless no one objects, I will open up a JIRA and try >>>> to do this. >>>> >>>> Yup but I think we shouldn't make this into Spark 2.3.0 to be clear. >>>> >>>> So given that it fixes some real world bugs, any particular reason why? >>>> Would you be comfortable with doing it in 2.3.1? >>>> >>>> >>>> >>>> > Also lets try to keep track in our commit messages which version of >>>> cloudpickle we end up upgrading to. >>>> >>>> +1: PR description, commit message or any unit to identify each will be >>>> useful. >>>> It should be easier once we have a matched version. >>>> >>>> >>>> >>>> 2018-01-19 12:55 GMT+09:00 Holden Karau <hol...@pigscanfly.ca>: >>>> >>>>> So if there are different version of Python on the cluster machines I >>>>> think that's already unsupported so I'm not worried about that. >>>>> >>>>> I'd suggest going to the highest released version since there appear >>>>> to be some useful fixes between 0.4.2 & 0.5.2 >>>>> >>>>> Also lets try to keep track in our commit messages which version of >>>>> cloudpickle we end up upgrading to. >>>>> >>>>> On Thu, Jan 18, 2018 at 5:45 PM, Bryan Cutler <cutl...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks for all the details and background Hyukjin! Regarding the >>>>>> pickle protocol change, if I understand correctly, it is currently at >>>>>> level >>>>>> 2 in Spark which is good for backwards compatibility for all of Python 2. >>>>>> Choosing HIGHEST_PROTOCOL, which is the default for cloudpickle 0.5.0 and >>>>>> above, will pick a level determined by your Python version. So is the >>>>>> concern here for Spark if someone has different versions of Python in >>>>>> their >>>>>> cluster, like 3.5 and 3.3, then different protocols will be used and >>>>>> deserialization might fail? Is it an option to match the latest version >>>>>> of >>>>>> cloudpickle and still set protocol level 2? >>>>>> >>>>>> I agree that upgrading to try and match version 0.4.2 would be a good >>>>>> starting point. Unless no one objects, I will open up a JIRA and try to >>>>>> do >>>>>> this. >>>>>> >>>>>> Thanks, >>>>>> Bryan >>>>>> >>>>>> On Mon, Jan 15, 2018 at 7:57 PM, Hyukjin Kwon <gurwls...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Bryan, >>>>>>> >>>>>>> Yup, I support to match the version. I pushed it forward before to >>>>>>> match it with https://github.com/cloudpipe/cloudpickle >>>>>>> before few times in Spark's copy and also cloudpickle itself with >>>>>>> few fixes. I believe our copy is closest to 0.4.1. >>>>>>> >>>>>>> I have been trying to follow up the changes in cloudpipe/cloudpickle >>>>>>> for which version we should match, I think we should match >>>>>>> it with 0.4.2 first (I need to double check) because IMHO they have >>>>>>> been adding rather radical changes from 0.5.0, including >>>>>>> pickle protocol change (by default). >>>>>>> >>>>>>> Personally, I would like to match it with the latest because there >>>>>>> have been some important changes. For >>>>>>> example, see this too - https://github.com/cloudpipe >>>>>>> /cloudpickle/pull/138 (it's pending for reviewing yet) eventually >>>>>>> but 0.4.2 should be >>>>>>> a good start point. >>>>>>> >>>>>>> For the strategy, I think we can match it and follow 0.4.x within >>>>>>> Spark for the conservative and safe choice + minimal cost. >>>>>>> >>>>>>> >>>>>>> I tried to leave few explicit answers to the questions from you, >>>>>>> Bryan: >>>>>>> >>>>>>> > Spark is currently using a forked version and it seems like >>>>>>> updates are made every now and then when >>>>>>> > needed, but it's not really clear where the current state is and >>>>>>> how much it has diverged. >>>>>>> >>>>>>> I am quite sure our cloudpickle copy is closer to 0.4.1 IIRC. >>>>>>> >>>>>>> >>>>>>> > Are there any known issues with recent changes from those that >>>>>>> follow cloudpickle dev? >>>>>>> >>>>>>> I am technically involved in cloudpickle dev although less active. >>>>>>> They changed default pickle protocol (https://github.com/cloudpipe/ >>>>>>> cloudpickle/pull/127). So, if we target 0.5.x+, we should double >>>>>>> check >>>>>>> the potential compatibility issue, or fix the protocol, which I >>>>>>> believe is introduced from 0.5.x. >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2018-01-16 11:43 GMT+09:00 Bryan Cutler <cutl...@gmail.com>: >>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> I've seen a couple issues lately related to cloudpickle, notably >>>>>>>> https://issues.apache.org/jira/browse/SPARK-22674, and would like >>>>>>>> to get some feedback on updating the version in PySpark which should >>>>>>>> fix >>>>>>>> these issues and allow us to remove some workarounds. Spark is >>>>>>>> currently >>>>>>>> using a forked version and it seems like updates are made every now and >>>>>>>> then when needed, but it's not really clear where the current state is >>>>>>>> and >>>>>>>> how much it has diverged. This makes back-porting fixes difficult. >>>>>>>> There >>>>>>>> was a previous discussion on moving it to a dependency here >>>>>>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-DISCUSS-Moving-to-cloudpickle-and-or-Py4J-as-a-dependencies-td20954.html>, >>>>>>>> but given the status right now I think it would be best to do another >>>>>>>> update and bring things closer to upstream before we talk about >>>>>>>> completely >>>>>>>> moving it outside of Spark. Before starting another update, it might >>>>>>>> be >>>>>>>> good to discuss the strategy a little. Should the version in Spark be >>>>>>>> derived from a release or at least tied to a specific commit? It would >>>>>>>> also be good if we can document where it has diverged. Are there any >>>>>>>> known >>>>>>>> issues with recent changes from those that follow cloudpickle dev? Any >>>>>>>> other thoughts or concerns? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Bryan >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> >>>> >>>> >>>> >>> >> >> >> -- >> Twitter: https://twitter.com/holdenkarau >> > >