Does Spark and Hive use Same SQL parser : ANTLR

2018-01-18 Thread Pralabh Kumar
Hi


Does hive and spark uses same SQL parser provided by ANTLR . Did they
generate the same logical plan .

Please help on the same.


Regards
Pralabh Kumar


Re: Build timed out for `branch-2.3 (hadoop-2.7)`

2018-01-18 Thread shane knapp
this doesn't have anything to do w/the git timeouts...  those will timeout
the build 10 mins after starting (and failing on the initial fetch call).

On Wed, Jan 17, 2018 at 9:51 PM, Sameer Agarwal  wrote:

> FYI, I ended up bumping the build timeouts from 255 to 275 minutes. All
> successful
> 
>  2.3
> (hadoop-2.7) builds last week were already taking 245-250 mins and had
> started timing out earlier today (towards the very end; while making
> consistent progress throughout). Increasing the timeout resolves the issue.
>
> NB: This might be either due to additional tests that were recently added
> or due to the git delays that Shane reported; we haven't investigated the
> root cause yet.
>
> On 12 January 2018 at 16:37, Dongjoon Hyun 
> wrote:
>
>> For this issue, during SPARK-23028, Shane shared that the server limit is
>> already higher.
>>
>> 1. Xiao Li increased the timeout of Spark test script for `master` branch
>> first in the following commit.
>>
>> [SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT
>> 
>>
>> 2. Marco Gaido reports a flaky test suite and it turns out that the test
>> suite hangs in SPARK-23055
>> 
>>
>> 3. Sameer Agarwal swiftly reverts it.
>>
>> Thank you all!
>>
>> Let's wait and see the dashboard
>> 
>> .
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Fri, Jan 12, 2018 at 3:22 PM, Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>>> FYI, we reverted a commit in https://github.com/apache/spar
>>> k/commit/55dbfbca37ce4c05f83180777ba3d4fe2d96a02e to fix the issue.
>>>
>>> On Fri, Jan 12, 2018 at 11:45 AM, Xin Lu  wrote:
>>>
 seems like someone should investigate what caused the build time to go
 up an hour and if it's expected or not.

 On Thu, Jan 11, 2018 at 7:37 PM, Dongjoon Hyun >>> > wrote:

> Hi, All and Shane.
>
> Can we increase the build time for `branch-2.3` during 2.3 RC period?
>
> There are two known test issues, but the Jenkins on branch-2.3 with
> hadoop-2.7 fails with build timeout. So, it's difficult to monitor whether
> the branch is healthy or not.
>
> Build timed out (after 255 minutes). Marking the build as aborted.
> Build was aborted
> ...
> Finished: ABORTED
>
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Tes
> t%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/60/console
> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Tes
> t%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/47/console
>
> Bests,
> Dongjoon.
>


>>>
>>
>


Re: [build system] currently experiencing git timeouts when building

2018-01-18 Thread shane knapp
quick update:

it looks like the timeouts have stopped.  github finally got back to me
about this, but (again) after they stopped happening.

i'll be keeping an eye on this for the next few days and will re-escalate
if we start having them again.

shane

On Tue, Jan 16, 2018 at 1:18 PM, shane knapp  wrote:

> update:  we're seeing about 3% of builds timing out, but today is (so far)
> a bad one:
>
> i've reached out to github about this and am waiting to hear back.
>
> $ get_timeouts.py 10
> Timeouts by project:
>  17spark-branch-2.3-lint
>  10spark-branch-2.3-test-maven-hadoop-2.6
>  7spark-branch-2.3-compile-maven-hadoop-2.7
>  41spark-master-lint
>  23spark-master-compile-maven-hadoop-2.6
>  32spark-master-compile-maven-hadoop-2.7
>  5spark-branch-2.2-compile-maven-hadoop-2.6
>  3spark-branch-2.2-test-maven-hadoop-2.7
>  6spark-branch-2.2-compile-maven-scala-2.10
>  6spark-master-test-maven-hadoop-2.6
>  3spark-master-test-maven-hadoop-2.7
>  14spark-branch-2.3-compile-maven-hadoop-2.6
>  3spark-branch-2.2-lint
>  1spark-branch-2.3-test-maven-hadoop-2.7
>
> Timeouts by day:
> 2018-01-094
> 2018-01-1013
> 2018-01-1127
> 2018-01-1274
> 2018-01-139
> 2018-01-142
> 2018-01-158
> 2018-01-1634
>
> Total builds:4112
> Total timeouts:171
> Percentage of all builds timing out:4.15856031128 <(585)%20603-1128>
>
> On Wed, Jan 10, 2018 at 9:54 AM, shane knapp  wrote:
>
>> i just noticed we're starting to see the once-yearly rash of git timeouts
>> when building.
>>
>> i'll be looking in to this today...  i'm at our lab retreat, so my
>> attention will be divided during the day but i will report back here once i
>> have some more information.
>>
>> in the meantime, if your jobs have a git timeout, please just retrigger
>> them and we will hope for the best.
>>
>> shane
>>
>
>


Re: [VOTE] Spark 2.3.0 (RC1)

2018-01-18 Thread Sameer Agarwal
This vote has failed in favor of a new RC. I'll follow up with a new RC2 as
soon as the 3 remaining test/UI blockers  are
resolved.

On 17 January 2018 at 16:38, Sameer Agarwal  wrote:

> Thanks, will do!
>
> On 16 January 2018 at 22:09, Holden Karau  wrote:
>
>> So looking at http://pgp.mit.edu/pks/lookup?op=vindex&search=0xA1CEDBA8
>> AD0C022A it seems like Sameer's key isn't in the Apache web of trust
>> yet. This shouldn't block RC process but before we publish it's important
>> to get the key in the Apache web of trust.
>>
>> On Tue, Jan 16, 2018 at 3:00 PM, Sameer Agarwal 
>> wrote:
>>
>>> Yes, I'll cut an RC2 as soon as the remaining blockers are resolved. In
>>> the meantime, please continue to report any other issues here.
>>>
>>> Here's a quick update on progress towards the next RC:
>>>
>>> - SPARK-22908 
>>> (KafkaContiniousSourceSuite) has been reverted
>>> - SPARK-23051 
>>> (Spark UI), SPARK-23063
>>>  (k8s packaging) and
>>> SPARK-23065  (R API
>>> docs) have all been resolved
>>> - A fix for SPARK-23020
>>>  (SparkLauncherSuite)
>>> has been merged. We're monitoring the builds to make sure that the
>>> flakiness has been resolved.
>>>
>>>
>>>
>>> On 16 January 2018 at 13:21, Ted Yu  wrote:
>>>
 Is there going to be another RC ?

 With KafkaContinuousSourceSuite hanging, it is hard to get the rest of
 the tests going.

 Cheers

 On Sat, Jan 13, 2018 at 7:29 AM, Sean Owen  wrote:

> The signatures and licenses look OK. Except for the missing k8s
> package, the contents look OK. Tests look pretty good with "-Phive
> -Phadoop-2.7 -Pyarn" on Ubuntu 17.10, except that 
> KafkaContinuousSourceSuite
> seems to hang forever. That was just fixed and needs to get into an RC?
>
> Aside from the Blockers just filed for R docs, etc., we have:
>
> Blocker:
> SPARK-23000 Flaky test suite DataSourceWithHiveMetastoreCatalogSuite
> in Spark 2.3
> SPARK-23020 Flaky Test: org.apache.spark.launcher.Spar
> kLauncherSuite.testInProcessLauncher
> SPARK-23051 job description in Spark UI is broken
>
> Critical:
> SPARK-22739 Additional Expression Support for Objects
>
> I actually don't think any of those Blockers should be Blockers; not
> sure if the last one is really critical either.
>
> I think this release will have to be re-rolled so I'd say -1 to RC1.
>
> On Fri, Jan 12, 2018 at 4:42 PM Sameer Agarwal 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark
>> version 2.3.0. The vote is open until Thursday January 18, 2018 at 
>> 8:00:00
>> am UTC and passes if a majority of at least 3 PMC +1 votes are cast.
>>
>>
>> [ ] +1 Release this package as Apache Spark 2.3.0
>>
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see
>> https://spark.apache.org/
>>
>> The tag to be voted on is v2.3.0-rc1: https://github.com/apache/spar
>> k/tree/v2.3.0-rc1 (964cc2e31b2862bca0bd968b3e9e2cbf8d3ba5ea)
>>
>> List of JIRA tickets resolved in this release can be found here:
>> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>>
>> The release files, including signatures, digests, etc. can be found
>> at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapache
>> spark-1261/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc1-docs
>> /_site/index.html
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala 
>> you
>> can add the staging repository to your projects resolvers and test with 
>> the
>> RC (make sure to clean up the artifact cache before/after so you don't 
>> end
>> up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA ti

Re: Thoughts on Cloudpickle Update

2018-01-18 Thread Bryan Cutler
Thanks for all the details and background Hyukjin! Regarding the pickle
protocol change, if I understand correctly, it is currently at level 2 in
Spark which is good for backwards compatibility for all of Python 2.
Choosing HIGHEST_PROTOCOL, which is the default for cloudpickle 0.5.0 and
above, will pick a level determined by your Python version. So is the
concern here for Spark if someone has different versions of Python in their
cluster, like 3.5 and 3.3, then different protocols will be used and
deserialization might fail?  Is it an option to match the latest version of
cloudpickle and still set protocol level 2?

I agree that upgrading to try and match version 0.4.2 would be a good
starting point. Unless no one objects, I will open up a JIRA and try to do
this.

Thanks,
Bryan

On Mon, Jan 15, 2018 at 7:57 PM, Hyukjin Kwon  wrote:

> Hi Bryan,
>
> Yup, I support to match the version. I pushed it forward before to match
> it with https://github.com/cloudpipe/cloudpickle
> before few times in Spark's copy and also cloudpickle itself with few
> fixes. I believe our copy is closest to 0.4.1.
>
> I have been trying to follow up the changes in cloudpipe/cloudpickle for
> which version we should match, I think we should match
> it with 0.4.2 first (I need to double check) because IMHO they have been
> adding rather radical changes from 0.5.0, including
> pickle protocol change (by default).
>
> Personally, I would like to match it with the latest because there have
> been some important changes. For
> example, see this too - https://github.com/cloudpipe/cloudpickle/pull/138
> (it's pending for reviewing yet) eventually but 0.4.2 should be
> a good start point.
>
> For the strategy, I think we can match it and follow 0.4.x within Spark
> for the conservative and safe choice + minimal cost.
>
>
> I tried to leave few explicit answers to the questions from you, Bryan:
>
> > Spark is currently using a forked version and it seems like updates are
> made every now and then when
> > needed, but it's not really clear where the current state is and how
> much it has diverged.
>
> I am quite sure our cloudpickle copy is closer to 0.4.1 IIRC.
>
>
> > Are there any known issues with recent changes from those that follow
> cloudpickle dev?
>
> I am technically involved in cloudpickle dev although less active.
> They changed default pickle protocol (https://github.com/cloudpipe/
> cloudpickle/pull/127). So, if we target 0.5.x+, we should double check
> the potential compatibility issue, or fix the protocol, which I believe is
> introduced from 0.5.x.
>
>
>
> 2018-01-16 11:43 GMT+09:00 Bryan Cutler :
>
>> Hi All,
>>
>> I've seen a couple issues lately related to cloudpickle, notably
>> https://issues.apache.org/jira/browse/SPARK-22674, and would like to get
>> some feedback on updating the version in PySpark which should fix these
>> issues and allow us to remove some workarounds.  Spark is currently using a
>> forked version and it seems like updates are made every now and then when
>> needed, but it's not really clear where the current state is and how much
>> it has diverged.  This makes back-porting fixes difficult.  There was a
>> previous discussion on moving it to a dependency here
>> ,
>> but given the status right now I think it would be best to do another
>> update and bring things closer to upstream before we talk about completely
>> moving it outside of Spark.  Before starting another update, it might be
>> good to discuss the strategy a little.  Should the version in Spark be
>> derived from a release or at least tied to a specific commit?  It would
>> also be good if we can document where it has diverged.  Are there any known
>> issues with recent changes from those that follow cloudpickle dev?  Any
>> other thoughts or concerns?
>>
>> Thanks,
>> Bryan
>>
>
>


Re: Thoughts on Cloudpickle Update

2018-01-18 Thread Holden Karau
So if there are different version of Python on the cluster machines I think
that's already unsupported so I'm not worried about that.

I'd suggest going to the highest released version since there appear to be
some useful fixes between 0.4.2 & 0.5.2

Also lets try to keep track in our commit messages which version of
cloudpickle we end up upgrading to.

On Thu, Jan 18, 2018 at 5:45 PM, Bryan Cutler  wrote:

> Thanks for all the details and background Hyukjin! Regarding the pickle
> protocol change, if I understand correctly, it is currently at level 2 in
> Spark which is good for backwards compatibility for all of Python 2.
> Choosing HIGHEST_PROTOCOL, which is the default for cloudpickle 0.5.0 and
> above, will pick a level determined by your Python version. So is the
> concern here for Spark if someone has different versions of Python in their
> cluster, like 3.5 and 3.3, then different protocols will be used and
> deserialization might fail?  Is it an option to match the latest version of
> cloudpickle and still set protocol level 2?
>
> I agree that upgrading to try and match version 0.4.2 would be a good
> starting point. Unless no one objects, I will open up a JIRA and try to do
> this.
>
> Thanks,
> Bryan
>
> On Mon, Jan 15, 2018 at 7:57 PM, Hyukjin Kwon  wrote:
>
>> Hi Bryan,
>>
>> Yup, I support to match the version. I pushed it forward before to match
>> it with https://github.com/cloudpipe/cloudpickle
>> before few times in Spark's copy and also cloudpickle itself with few
>> fixes. I believe our copy is closest to 0.4.1.
>>
>> I have been trying to follow up the changes in cloudpipe/cloudpickle for
>> which version we should match, I think we should match
>> it with 0.4.2 first (I need to double check) because IMHO they have been
>> adding rather radical changes from 0.5.0, including
>> pickle protocol change (by default).
>>
>> Personally, I would like to match it with the latest because there have
>> been some important changes. For
>> example, see this too - https://github.com/cloudpipe/cloudpickle/pull/138
>> (it's pending for reviewing yet) eventually but 0.4.2 should be
>> a good start point.
>>
>> For the strategy, I think we can match it and follow 0.4.x within Spark
>> for the conservative and safe choice + minimal cost.
>>
>>
>> I tried to leave few explicit answers to the questions from you, Bryan:
>>
>> > Spark is currently using a forked version and it seems like updates
>> are made every now and then when
>> > needed, but it's not really clear where the current state is and how
>> much it has diverged.
>>
>> I am quite sure our cloudpickle copy is closer to 0.4.1 IIRC.
>>
>>
>> > Are there any known issues with recent changes from those that follow
>> cloudpickle dev?
>>
>> I am technically involved in cloudpickle dev although less active.
>> They changed default pickle protocol (https://github.com/cloudpipe/
>> cloudpickle/pull/127). So, if we target 0.5.x+, we should double check
>> the potential compatibility issue, or fix the protocol, which I believe
>> is introduced from 0.5.x.
>>
>>
>>
>> 2018-01-16 11:43 GMT+09:00 Bryan Cutler :
>>
>>> Hi All,
>>>
>>> I've seen a couple issues lately related to cloudpickle, notably
>>> https://issues.apache.org/jira/browse/SPARK-22674, and would like to
>>> get some feedback on updating the version in PySpark which should fix these
>>> issues and allow us to remove some workarounds.  Spark is currently using a
>>> forked version and it seems like updates are made every now and then when
>>> needed, but it's not really clear where the current state is and how much
>>> it has diverged.  This makes back-porting fixes difficult.  There was a
>>> previous discussion on moving it to a dependency here
>>> ,
>>> but given the status right now I think it would be best to do another
>>> update and bring things closer to upstream before we talk about completely
>>> moving it outside of Spark.  Before starting another update, it might be
>>> good to discuss the strategy a little.  Should the version in Spark be
>>> derived from a release or at least tied to a specific commit?  It would
>>> also be good if we can document where it has diverged.  Are there any known
>>> issues with recent changes from those that follow cloudpickle dev?  Any
>>> other thoughts or concerns?
>>>
>>> Thanks,
>>> Bryan
>>>
>>
>>
>


-- 
Twitter: https://twitter.com/holdenkarau


Re: Thoughts on Cloudpickle Update

2018-01-18 Thread Hyukjin Kwon
> Is it an option to match the latest version of cloudpickle and still set
protocol level 2?

IMHO, I think this can be an option but I am not fully sure yet if we
should/could go ahead for it within Spark 2.X. I need some
investigations including things about Pyrolite.

Let's go ahead with matching it to 0.4.2 first. I am quite clear on
matching it to 0.4.2 at least.


> I agree that upgrading to try and match version 0.4.2 would be a good
starting point. Unless no one objects, I will open up a JIRA and try to do
this.

Yup but I think we shouldn't make this into Spark 2.3.0 to be clear.


> Also lets try to keep track in our commit messages which version of
cloudpickle we end up upgrading to.

+1: PR description, commit message or any unit to identify each will be useful.
It should be easier once we have a matched version.



2018-01-19 12:55 GMT+09:00 Holden Karau :

> So if there are different version of Python on the cluster machines I
> think that's already unsupported so I'm not worried about that.
>
> I'd suggest going to the highest released version since there appear to be
> some useful fixes between 0.4.2 & 0.5.2
>
> Also lets try to keep track in our commit messages which version of
> cloudpickle we end up upgrading to.
>
> On Thu, Jan 18, 2018 at 5:45 PM, Bryan Cutler  wrote:
>
>> Thanks for all the details and background Hyukjin! Regarding the pickle
>> protocol change, if I understand correctly, it is currently at level 2 in
>> Spark which is good for backwards compatibility for all of Python 2.
>> Choosing HIGHEST_PROTOCOL, which is the default for cloudpickle 0.5.0 and
>> above, will pick a level determined by your Python version. So is the
>> concern here for Spark if someone has different versions of Python in their
>> cluster, like 3.5 and 3.3, then different protocols will be used and
>> deserialization might fail?  Is it an option to match the latest version of
>> cloudpickle and still set protocol level 2?
>>
>> I agree that upgrading to try and match version 0.4.2 would be a good
>> starting point. Unless no one objects, I will open up a JIRA and try to do
>> this.
>>
>> Thanks,
>> Bryan
>>
>> On Mon, Jan 15, 2018 at 7:57 PM, Hyukjin Kwon 
>> wrote:
>>
>>> Hi Bryan,
>>>
>>> Yup, I support to match the version. I pushed it forward before to match
>>> it with https://github.com/cloudpipe/cloudpickle
>>> before few times in Spark's copy and also cloudpickle itself with few
>>> fixes. I believe our copy is closest to 0.4.1.
>>>
>>> I have been trying to follow up the changes in cloudpipe/cloudpickle for
>>> which version we should match, I think we should match
>>> it with 0.4.2 first (I need to double check) because IMHO they have been
>>> adding rather radical changes from 0.5.0, including
>>> pickle protocol change (by default).
>>>
>>> Personally, I would like to match it with the latest because there have
>>> been some important changes. For
>>> example, see this too - https://github.com/cloudpipe
>>> /cloudpickle/pull/138 (it's pending for reviewing yet) eventually but
>>> 0.4.2 should be
>>> a good start point.
>>>
>>> For the strategy, I think we can match it and follow 0.4.x within Spark
>>> for the conservative and safe choice + minimal cost.
>>>
>>>
>>> I tried to leave few explicit answers to the questions from you, Bryan:
>>>
>>> > Spark is currently using a forked version and it seems like updates
>>> are made every now and then when
>>> > needed, but it's not really clear where the current state is and how
>>> much it has diverged.
>>>
>>> I am quite sure our cloudpickle copy is closer to 0.4.1 IIRC.
>>>
>>>
>>> > Are there any known issues with recent changes from those that follow
>>> cloudpickle dev?
>>>
>>> I am technically involved in cloudpickle dev although less active.
>>> They changed default pickle protocol (https://github.com/cloudpipe/
>>> cloudpickle/pull/127). So, if we target 0.5.x+, we should double check
>>> the potential compatibility issue, or fix the protocol, which I believe
>>> is introduced from 0.5.x.
>>>
>>>
>>>
>>> 2018-01-16 11:43 GMT+09:00 Bryan Cutler :
>>>
 Hi All,

 I've seen a couple issues lately related to cloudpickle, notably
 https://issues.apache.org/jira/browse/SPARK-22674, and would like to
 get some feedback on updating the version in PySpark which should fix these
 issues and allow us to remove some workarounds.  Spark is currently using a
 forked version and it seems like updates are made every now and then when
 needed, but it's not really clear where the current state is and how much
 it has diverged.  This makes back-porting fixes difficult.  There was a
 previous discussion on moving it to a dependency here
 ,
 but given the status right now I think it would be best to do another
 update and bring things closer to upstream before we talk a

Re: Thoughts on Cloudpickle Update

2018-01-18 Thread Holden Karau
On Jan 19, 2018 7:28 PM, "Hyukjin Kwon"  wrote:

> Is it an option to match the latest version of cloudpickle and still set
protocol level 2?

IMHO, I think this can be an option but I am not fully sure yet if we
should/could go ahead for it within Spark 2.X. I need some
investigations including things about Pyrolite.

Let's go ahead with matching it to 0.4.2 first. I am quite clear on
matching it to 0.4.2 at least.

So given that there is a follow up on which fixes a regression if we're not
comfortable doing the latest version let's double-check that the version we
do upgrade to doesn't have that regression.



> I agree that upgrading to try and match version 0.4.2 would be a good
starting point. Unless no one objects, I will open up a JIRA and try to do
this.

Yup but I think we shouldn't make this into Spark 2.3.0 to be clear.

So given that it fixes some real world bugs, any particular reason why?
Would you be comfortable with doing it in 2.3.1?



> Also lets try to keep track in our commit messages which version of
cloudpickle we end up upgrading to.

+1: PR description, commit message or any unit to identify each will be useful.
It should be easier once we have a matched version.



2018-01-19 12:55 GMT+09:00 Holden Karau :

> So if there are different version of Python on the cluster machines I
> think that's already unsupported so I'm not worried about that.
>
> I'd suggest going to the highest released version since there appear to be
> some useful fixes between 0.4.2 & 0.5.2
>
> Also lets try to keep track in our commit messages which version of
> cloudpickle we end up upgrading to.
>
> On Thu, Jan 18, 2018 at 5:45 PM, Bryan Cutler  wrote:
>
>> Thanks for all the details and background Hyukjin! Regarding the pickle
>> protocol change, if I understand correctly, it is currently at level 2 in
>> Spark which is good for backwards compatibility for all of Python 2.
>> Choosing HIGHEST_PROTOCOL, which is the default for cloudpickle 0.5.0 and
>> above, will pick a level determined by your Python version. So is the
>> concern here for Spark if someone has different versions of Python in their
>> cluster, like 3.5 and 3.3, then different protocols will be used and
>> deserialization might fail?  Is it an option to match the latest version of
>> cloudpickle and still set protocol level 2?
>>
>> I agree that upgrading to try and match version 0.4.2 would be a good
>> starting point. Unless no one objects, I will open up a JIRA and try to do
>> this.
>>
>> Thanks,
>> Bryan
>>
>> On Mon, Jan 15, 2018 at 7:57 PM, Hyukjin Kwon 
>> wrote:
>>
>>> Hi Bryan,
>>>
>>> Yup, I support to match the version. I pushed it forward before to match
>>> it with https://github.com/cloudpipe/cloudpickle
>>> before few times in Spark's copy and also cloudpickle itself with few
>>> fixes. I believe our copy is closest to 0.4.1.
>>>
>>> I have been trying to follow up the changes in cloudpipe/cloudpickle for
>>> which version we should match, I think we should match
>>> it with 0.4.2 first (I need to double check) because IMHO they have been
>>> adding rather radical changes from 0.5.0, including
>>> pickle protocol change (by default).
>>>
>>> Personally, I would like to match it with the latest because there have
>>> been some important changes. For
>>> example, see this too - https://github.com/cloudpipe
>>> /cloudpickle/pull/138 (it's pending for reviewing yet) eventually but
>>> 0.4.2 should be
>>> a good start point.
>>>
>>> For the strategy, I think we can match it and follow 0.4.x within Spark
>>> for the conservative and safe choice + minimal cost.
>>>
>>>
>>> I tried to leave few explicit answers to the questions from you, Bryan:
>>>
>>> > Spark is currently using a forked version and it seems like updates
>>> are made every now and then when
>>> > needed, but it's not really clear where the current state is and how
>>> much it has diverged.
>>>
>>> I am quite sure our cloudpickle copy is closer to 0.4.1 IIRC.
>>>
>>>
>>> > Are there any known issues with recent changes from those that follow
>>> cloudpickle dev?
>>>
>>> I am technically involved in cloudpickle dev although less active.
>>> They changed default pickle protocol (https://github.com/cloudpipe/
>>> cloudpickle/pull/127). So, if we target 0.5.x+, we should double check
>>> the potential compatibility issue, or fix the protocol, which I believe
>>> is introduced from 0.5.x.
>>>
>>>
>>>
>>> 2018-01-16 11:43 GMT+09:00 Bryan Cutler :
>>>
 Hi All,

 I've seen a couple issues lately related to cloudpickle, notably
 https://issues.apache.org/jira/browse/SPARK-22674, and would like to
 get some feedback on updating the version in PySpark which should fix these
 issues and allow us to remove some workarounds.  Spark is currently using a
 forked version and it seems like updates are made every now and then when
 needed, but it's not really clear where the current state is and how much
 it has diverged.  This makes back-porting fixe