[VOTE] SPARK 2.3.2 (RC5)

2018-08-14 Thread Saisai Shao
Please vote on releasing the following candidate as Apache Spark version
2.3.2.

The vote is open until August 20 PST and passes if a majority +1 PMC votes
are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 2.3.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v2.3.2-rc5 (commit
4dc82259d81102e0cb48f4cb2e8075f80d899ac4):
https://github.com/apache/spark/tree/v2.3.2-rc5

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc5-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1281/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc5-docs/

The list of bug fixes going into 2.3.2 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12343289

Note. RC4 was cancelled because of one blocking issue SPARK-25084 during
release preparation.

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 2.3.2?
===

The current list of open tickets targeted at 2.3.2 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 2.3.2

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


Re: [VOTE] SPARK 2.3.2 (RC5)

2018-08-14 Thread Marco Gaido
-1, due to SPARK-25051. It is a regression and it is a correctness bug. In
2.3.0/2.3.1 an Analysis exception was thrown, 2.2.* works fine.
I cannot reproduce the issue on current master, but I was able using the
prepared 2.3.2 release.


Il giorno mar 14 ago 2018 alle ore 10:04 Saisai Shao 
ha scritto:

> Please vote on releasing the following candidate as Apache Spark version
> 2.3.2.
>
> The vote is open until August 20 PST and passes if a majority +1 PMC votes
> are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.3.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.2-rc5 (commit
> 4dc82259d81102e0cb48f4cb2e8075f80d899ac4):
> https://github.com/apache/spark/tree/v2.3.2-rc5
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc5-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1281/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.2-rc5-docs/
>
> The list of bug fixes going into 2.3.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12343289
>
> Note. RC4 was cancelled because of one blocking issue SPARK-25084 during
> release preparation.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.2?
> ===
>
> The current list of open tickets targeted at 2.3.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.3.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>


[DISCUSS][SPARK-22674][PYTHON] Disabled _hack_namedtuple for picklable namedtuples

2018-08-14 Thread Sergei Lebedev
Hi all,

Some time ago we've discovered that PySpark patches
collections.namedtuple to allow unpickling of namedtuples defined in the
REPL on the executors. Side-effects of the patch include

* hard to debug failures -- we originally came across this while
investigating a TensorFlowOnSpark failure, see [1];
* serialization overhead -- each namedtuple instance carries a full
namedtuple definition.

I think it is best to completely remove the patch since the benefits it
brings are insignificant compared to the issues. However, there is a middle
ground which to me looks non-intrusive enough to be releasable in the 2.X
branch. The proposed PR [2] does not break any of the currently working
usages of namedtuple while reducing the damage done by the patch when the
namedtuple or its subclass is importable.

Do you think it might be possible to merge the PR in either 2.4.X or the
following 2.X release?

Cheers,
Sergei

[1]:
https://superbobry.github.io/tensorflowonspark-or-the-namedtuple-patch-strikes-again.html
[2]: https://github.com/apache/spark/pull/21180


Same code in DataFrameWriter.runCommand and Dataset.withAction?

2018-08-14 Thread Jacek Laskowski
Hi,

I'm curious why Spark SQL uses two different methods for the seemingly very
same code?

* DataFrameWriter.runCommand -->
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L663

* Dataset.withAction -->
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L3317

It looks like the relationship is as follows:

DataFrameWriter.runCommand == Dataset.withAction(_.execute)

Should one be removed for the other? I'd first change runCommand to use
withAction(_.execute) or even remove runCommand altogether.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


sql compile failing with Zinc?

2018-08-14 Thread Steve Loughran
Is anyone else getting the sql module maven build on master branch failing when 
you use zinc for incremental builds?

[warn]   ^
java.lang.OutOfMemoryError: GC overhead limit exceeded
at 
scala.tools.nsc.backend.icode.GenICode$Scope.(GenICode.scala:2225)
at 
scala.tools.nsc.backend.icode.GenICode$ICodePhase$Context.enterScope(GenICode.scala:1916)
at 
scala.tools.nsc.backend.icode.GenICode$ICodePhase$Context.enterMethod(GenICode.scala:1901)
at 
scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:118)
at 
scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:71)
at 
scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:148)
at 
scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:98)
at 
scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:71)

All is well when zinc is disabled, so I've a workaround -its just a very slow 
workaround


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: sql compile failing with Zinc?

2018-08-14 Thread Sean Owen
If you're running zinc directly, you can give it more memory with -J-Xmx2g
or whatever. If you're running ./build/mvn and letting it run zinc we might
need to increase the memory that it requests in the script.

On Tue, Aug 14, 2018 at 2:56 PM Steve Loughran 
wrote:

> Is anyone else getting the sql module maven build on master branch failing
> when you use zinc for incremental builds?
>
> [warn]   ^
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at
> scala.tools.nsc.backend.icode.GenICode$Scope.(GenICode.scala:2225)
> at
> scala.tools.nsc.backend.icode.GenICode$ICodePhase$Context.enterScope(GenICode.scala:1916)
> at
> scala.tools.nsc.backend.icode.GenICode$ICodePhase$Context.enterMethod(GenICode.scala:1901)
> at
> scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:118)
> at
> scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:71)
> at
> scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:148)
> at
> scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:98)
> at
> scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:71)
>
> All is well when zinc is disabled, so I've a workaround -its just a very
> slow workaround
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: sql compile failing with Zinc?

2018-08-14 Thread Marco Gaido
I am not sure, I managed to build successfully using the mvn in the
distribution today.

Il giorno mar 14 ago 2018 alle ore 22:02 Sean Owen  ha
scritto:

> If you're running zinc directly, you can give it more memory with -J-Xmx2g
> or whatever. If you're running ./build/mvn and letting it run zinc we might
> need to increase the memory that it requests in the script.
>
> On Tue, Aug 14, 2018 at 2:56 PM Steve Loughran 
> wrote:
>
>> Is anyone else getting the sql module maven build on master branch
>> failing when you use zinc for incremental builds?
>>
>> [warn]   ^
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>> at
>> scala.tools.nsc.backend.icode.GenICode$Scope.(GenICode.scala:2225)
>> at
>> scala.tools.nsc.backend.icode.GenICode$ICodePhase$Context.enterScope(GenICode.scala:1916)
>> at
>> scala.tools.nsc.backend.icode.GenICode$ICodePhase$Context.enterMethod(GenICode.scala:1901)
>> at
>> scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:118)
>> at
>> scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:71)
>> at
>> scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:148)
>> at
>> scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:98)
>> at
>> scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:71)
>>
>> All is well when zinc is disabled, so I've a workaround -its just a very
>> slow workaround
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [DISCUSS] Handling correctness/data loss jiras

2018-08-14 Thread Imran Rashid
+1 on what we should do.

On Mon, Aug 13, 2018 at 3:06 PM, Tom Graves 
wrote:

>
> > I mean, what are concrete steps beyond saying this is a problem? That's
> the important thing to discuss.
>
> Sorry I'm a bit confused by your statement but also think I agree.  I
> started this thread for this reason. I pointed out that I thought it was a
> problem and also brought up things I thought we could do to help fix it.
>
> Maybe I wasn't clear in the first email, the list of things I had were
> proposals on what we do for a jira that is for a correctness/data loss
> issue. Its the committers and developers that are involved in this though
> so if people don't agree or aren't going to do them, then it doesn't work.
>
> Just to restate what I think we should do:
>
> - label any correctness/data loss jira with "correctness"
> - jira should be marked as a blocker by default if someone suspects a
> corruption/loss issue
> - Make sure the description is clear about when it occurs and impact to
> the user.
> - ensure its back ported to all active branches
> - See if we can have a separate section in the release notes for these
>
> The last one I guess is more a one time thing that i can file a jira for.
> The first 4 would be done for each jira filed.
>
> I'm proposing we do these things and as such if people agree we would also
> document those things in the committers or developers guide and send email
> to the list.
>
>
>
> Tom
> On Monday, August 13, 2018, 11:17:22 AM CDT, Sean Owen 
> wrote:
>
>
> Generally: if someone thinks correctness fix X should be backported
> further, I'd say just do it, if it's to an active release branch (see
> below). Anything that important has to outweigh most any other concern,
> like behavior changes.
>
>
> On Mon, Aug 13, 2018 at 11:08 AM Tom Graves  wrote:
>
> I'm not really sure what you mean by this, this proposal is to introduce a
> process for this type of issue so its at least brought to peoples
> attention. We can't do anything to make people work on certain things.  If
> they aren't raised as important issues then its really easy to miss these
> things.  If its a blocker we should also not be doing any new releases
> without a fix for it which may motivate people to look at it.
>
>
> I mean, what are concrete steps beyond saying this is a problem? That's
> the important thing to discuss.
>
> There's a good one here: let's say anything that's likely to be a
> correctness or data loss issue should automatically be labeled
> 'correctness' as such and set to Blocker.
>
> That can go into the how-to-contribute manual in the docs and in a note to
> dev@.
>
>
>
> I agree it would be good for us to make it more official about which
> branches are being maintained.  I think at this point its still 2.1.x,
> 2.2.x, and 2.3.x since we recently did releases of all of these.  Since 2.4
> will be coming out we should definitely think about stop maintaining
> 2.1.x.  Perhaps we need a table on our release page about this.  But this
> should be a separate thread.
>
>
> I propose writing something like this in the 'versioning' doc page, to at
> least establish a policy:
>
> Minor release branches will, generally, be maintained with bug fixes
> releases for a period of 18 months. For example, branch 2.1.x is no longer
> considered maintained as of July 2018, 18 months after the release of 2.1.0
> in December 2106.
>
> This gives us -- and more importantly users -- some understanding of what
> to expect for backporting and fixes.
>
>
> I am going to revive the thread about adding PMC / committers as it's
> overdue. That may not do much, but, more hands to do more work ought to
> possibly free up people to focus on deeper harder issues.
>


Re: [VOTE] SPARK 2.3.2 (RC5)

2018-08-14 Thread antonkulaga
-1 as https://issues.apache.org/jira/browse/SPARK-16406 does not seem to be
back-ported to 2.3.1 and it causes a lot of pain



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.3.2 (RC5)

2018-08-14 Thread Sean Owen
(We wouldn't consider lack of an improvement to block a maintenance
release. It's reasonable to raise this elsewhere as a big nice to have on
2.3.x in general)

On Tue, Aug 14, 2018, 4:13 PM antonkulaga  wrote:

> -1 as https://issues.apache.org/jira/browse/SPARK-16406 does not seem to
> be
> back-ported to 2.3.1 and it causes a lot of pain
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [Performance] Spark DataFrame is slow with wide data. Polynomial complexity on the number of columns is observed. Why?

2018-08-14 Thread antonkulaga
Is it not going to be backported to 2.3.2? I am totally blocked by this issue
in one of my projects.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] SPARK 2.3.2 (RC5)

2018-08-14 Thread Wenchen Fan
SPARK-25051 is resolved, can we start a new RC?

SPARK-16406 is an improvement, generally we should not backport.

On Wed, Aug 15, 2018 at 5:16 AM Sean Owen  wrote:

> (We wouldn't consider lack of an improvement to block a maintenance
> release. It's reasonable to raise this elsewhere as a big nice to have on
> 2.3.x in general)
>
> On Tue, Aug 14, 2018, 4:13 PM antonkulaga  wrote:
>
>> -1 as https://issues.apache.org/jira/browse/SPARK-16406 does not seem to
>> be
>> back-ported to 2.3.1 and it causes a lot of pain
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


[SPARK-24771] Upgrade AVRO version from 1.7.7 to 1.8

2018-08-14 Thread Wenchen Fan
Hi all,

We've upgraded Avro from 1.7 to 1.8, to support date/timestamp/decimal
types in the newly added Avro data source in the coming Spark 2.4, and also
to make Avro work with Parquet.

Since Avro 1.8 is not binary compatible with Avro 1.7 (see
https://issues.apache.org/jira/browse/AVRO-1502), users may need to
re-compile their Spark applications if they use Spark with Avro.

I'm sending this email to collect feedbacks for this upgrade, so that we
can write the release notes more clearly about what can be broken.

Thanks,
Wenchen


Re: [VOTE] SPARK 2.3.2 (RC5)

2018-08-14 Thread Saisai Shao
There's still another one SPARK-25114.

I will wait for several days in case some other blocks jumped.

Thanks
Saisai



Wenchen Fan  于2018年8月15日周三 上午10:19写道:

> SPARK-25051 is resolved, can we start a new RC?
>
> SPARK-16406 is an improvement, generally we should not backport.
>
> On Wed, Aug 15, 2018 at 5:16 AM Sean Owen  wrote:
>
>> (We wouldn't consider lack of an improvement to block a maintenance
>> release. It's reasonable to raise this elsewhere as a big nice to have on
>> 2.3.x in general)
>>
>> On Tue, Aug 14, 2018, 4:13 PM antonkulaga  wrote:
>>
>>> -1 as https://issues.apache.org/jira/browse/SPARK-16406 does not seem
>>> to be
>>> back-ported to 2.3.1 and it causes a lot of pain
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: sql compile failing with Zinc?

2018-08-14 Thread Steve Loughran
thanks. I'm launching zinc by hand, but then mvn is handing it off. Might be 
best to make the memory property configurable so that people can pla with it 
themselves.

On 14 Aug 2018, at 13:02, Sean Owen mailto:sro...@gmail.com>> 
wrote:

If you're running zinc directly, you can give it more memory with -J-Xmx2g or 
whatever. If you're running ./build/mvn and letting it run zinc we might need 
to increase the memory that it requests in the script.

On Tue, Aug 14, 2018 at 2:56 PM Steve Loughran 
mailto:ste...@hortonworks.com>> wrote:
Is anyone else getting the sql module maven build on master branch failing when 
you use zinc for incremental builds?

[warn]   ^
java.lang.OutOfMemoryError: GC overhead limit exceeded
at 
scala.tools.nsc.backend.icode.GenICode$Scope.(GenICode.scala:2225)
at 
scala.tools.nsc.backend.icode.GenICode$ICodePhase$Context.enterScope(GenICode.scala:1916)
at 
scala.tools.nsc.backend.icode.GenICode$ICodePhase$Context.enterMethod(GenICode.scala:1901)
at 
scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:118)
at 
scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:71)
at 
scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:148)
at 
scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:98)
at 
scala.tools.nsc.backend.icode.GenICode$ICodePhase.gen(GenICode.scala:71)

All is well when zinc is disabled, so I've a workaround -its just a very slow 
workaround


-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org