Hi All,

I've seen a couple issues lately related to cloudpickle, notably
https://issues.apache.org/jira/browse/SPARK-22674, and would like to get
some feedback on updating the version in PySpark which should fix these
issues and allow us to remove some workarounds.  Spark is currently using a
forked version and it seems like updates are made every now and then when
needed, but it's not really clear where the current state is and how much
it has diverged.  This makes back-porting fixes difficult.  There was a
previous discussion on moving it to a dependency here
<http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-DISCUSS-Moving-to-cloudpickle-and-or-Py4J-as-a-dependencies-td20954.html>,
but given the status right now I think it would be best to do another
update and bring things closer to upstream before we talk about completely
moving it outside of Spark.  Before starting another update, it might be
good to discuss the strategy a little.  Should the version in Spark be
derived from a release or at least tied to a specific commit?  It would
also be good if we can document where it has diverged.  Are there any known
issues with recent changes from those that follow cloudpickle dev?  Any
other thoughts or concerns?

Thanks,
Bryan

Reply via email to