Hi Bryan, Yup, I support to match the version. I pushed it forward before to match it with https://github.com/cloudpipe/cloudpickle before few times in Spark's copy and also cloudpickle itself with few fixes. I believe our copy is closest to 0.4.1.
I have been trying to follow up the changes in cloudpipe/cloudpickle for which version we should match, I think we should match it with 0.4.2 first (I need to double check) because IMHO they have been adding rather radical changes from 0.5.0, including pickle protocol change (by default). Personally, I would like to match it with the latest because there have been some important changes. For example, see this too - https://github.com/cloudpipe/cloudpickle/pull/138 (it's pending for reviewing yet) eventually but 0.4.2 should be a good start point. For the strategy, I think we can match it and follow 0.4.x within Spark for the conservative and safe choice + minimal cost. I tried to leave few explicit answers to the questions from you, Bryan: > Spark is currently using a forked version and it seems like updates are made every now and then when > needed, but it's not really clear where the current state is and how much it has diverged. I am quite sure our cloudpickle copy is closer to 0.4.1 IIRC. > Are there any known issues with recent changes from those that follow cloudpickle dev? I am technically involved in cloudpickle dev although less active. They changed default pickle protocol ( https://github.com/cloudpipe/cloudpickle/pull/127). So, if we target 0.5.x+, we should double check the potential compatibility issue, or fix the protocol, which I believe is introduced from 0.5.x. 2018-01-16 11:43 GMT+09:00 Bryan Cutler <cutl...@gmail.com>: > Hi All, > > I've seen a couple issues lately related to cloudpickle, notably > https://issues.apache.org/jira/browse/SPARK-22674, and would like to get > some feedback on updating the version in PySpark which should fix these > issues and allow us to remove some workarounds. Spark is currently using a > forked version and it seems like updates are made every now and then when > needed, but it's not really clear where the current state is and how much > it has diverged. This makes back-porting fixes difficult. There was a > previous discussion on moving it to a dependency here > <http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-DISCUSS-Moving-to-cloudpickle-and-or-Py4J-as-a-dependencies-td20954.html>, > but given the status right now I think it would be best to do another > update and bring things closer to upstream before we talk about completely > moving it outside of Spark. Before starting another update, it might be > good to discuss the strategy a little. Should the version in Spark be > derived from a release or at least tied to a specific commit? It would > also be good if we can document where it has diverged. Are there any known > issues with recent changes from those that follow cloudpickle dev? Any > other thoughts or concerns? > > Thanks, > Bryan >