Re: [VOTE] Release Apache Spark 2.4.5 (RC2)

2020-02-04 Thread Takeshi Yamamuro
+1;
 I run the tests with
`-Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pkubernetes
-Psparkr`
on macOS (Java 8).
All the things look fine in my env.

Bests,
Takeshi

On Tue, Feb 4, 2020 at 12:35 PM Hyukjin Kwon  wrote:

> +1 from me too.
>
> 2020년 2월 4일 (화) 오후 12:26, Wenchen Fan 님이 작성:
>
>> AFAIK there is no ongoing critical bug fixes, +1
>>
>> On Mon, Feb 3, 2020 at 11:46 PM Dongjoon Hyun 
>> wrote:
>>
>>> Yes, it does officially since 2.4.0.
>>>
>>> 2.4.5 is a maintenance release of 2.4.x line and the community didn't
>>> support Hadoop 3.x on 'branch-2.4'. We didn't run test at all.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Sun, Feb 2, 2020 at 22:58 Ajith shetty 
>>> wrote:
>>>
 Is hadoop-3.1 profile supported for this release.? i see lot of UTs
 failing under this profile.
 https://github.com/apache/spark/blob/v2.4.5-rc2/pom.xml

 *Example:*
  [INFO] Running org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite
 [ERROR] Tests run: 3, Failures: 0, Errors: 3, Skipped: 0, Time elapsed:
 1.717 s <<< FAILURE! - in
 org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite
 [ERROR]
 saveExternalTableAndQueryIt(org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite)
 Time elapsed: 1.675 s  <<< ERROR!
 java.lang.ExceptionInInitializerError
 at
 org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite.setUp(JavaMetastoreDataSourcesSuite.java:66)
 Caused by: java.lang.IllegalArgumentException: *Unrecognized Hadoop
 major version number: 3.1.0*
 at
 org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite.setUp(JavaMetastoreDataSourcesSuite.java:66)

>>>

-- 
---
Takeshi Yamamuro


Re: [VOTE] Release Apache Spark 2.4.5 (RC2)

2020-02-04 Thread Maxim Gekk
+1
I re-ran some of existing benchmarks in branch-2.4 on Linux/MacOS, and
haven't found any regressions compared to 2.4.4.

Maxim Gekk


On Tue, Feb 4, 2020 at 11:07 AM Takeshi Yamamuro 
wrote:

> +1;
>  I run the tests with
> `-Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pkubernetes
> -Psparkr`
> on macOS (Java 8).
> All the things look fine in my env.
>
> Bests,
> Takeshi
>
> On Tue, Feb 4, 2020 at 12:35 PM Hyukjin Kwon  wrote:
>
>> +1 from me too.
>>
>> 2020년 2월 4일 (화) 오후 12:26, Wenchen Fan 님이 작성:
>>
>>> AFAIK there is no ongoing critical bug fixes, +1
>>>
>>> On Mon, Feb 3, 2020 at 11:46 PM Dongjoon Hyun 
>>> wrote:
>>>
 Yes, it does officially since 2.4.0.

 2.4.5 is a maintenance release of 2.4.x line and the community didn't
 support Hadoop 3.x on 'branch-2.4'. We didn't run test at all.

 Bests,
 Dongjoon.

 On Sun, Feb 2, 2020 at 22:58 Ajith shetty 
 wrote:

> Is hadoop-3.1 profile supported for this release.? i see lot of UTs
> failing under this profile.
> https://github.com/apache/spark/blob/v2.4.5-rc2/pom.xml
>
> *Example:*
>  [INFO] Running org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite
> [ERROR] Tests run: 3, Failures: 0, Errors: 3, Skipped: 0, Time
> elapsed: 1.717 s <<< FAILURE! - in
> org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite
> [ERROR]
> saveExternalTableAndQueryIt(org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite)
> Time elapsed: 1.675 s  <<< ERROR!
> java.lang.ExceptionInInitializerError
> at
> org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite.setUp(JavaMetastoreDataSourcesSuite.java:66)
> Caused by: java.lang.IllegalArgumentException: *Unrecognized Hadoop
> major version number: 3.1.0*
> at
> org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite.setUp(JavaMetastoreDataSourcesSuite.java:66)
>

>
> --
> ---
> Takeshi Yamamuro
>


unify benchmarks in 2.4 and regenerate results

2020-02-04 Thread Maxim Gekk
Hi All,

Currently, most of benchmark results are embedded into benchmark source
codes in Spark 2.4.x. This makes comparison of the results between 2.4
releases and master pretty inconvenient. I would like to propose to unify
benchmarks in branch-2.4 by backporting the changes made in the master:
https://issues.apache.org/jira/browse/SPARK-25475 and regenerate all
results in the same environment. This will allow to compare Spark 3.0 to
Spark 2.4.x, and minor releases of 2.4.x.

Maxim Gekk

Software Engineer

Databricks, Inc.


Re: [VOTE] Release Apache Spark 2.4.5 (RC2)

2020-02-04 Thread Sean Owen
+1 from me too. Same outcome as in RC1 for me.

On Sun, Feb 2, 2020 at 9:31 PM Dongjoon Hyun  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.4.5.
>
> The vote is open until February 5th 11PM PST and passes if a majority +1 PMC 
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.5
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.4.5-rc2 (commit 
> cee4ecbb16917fa85f02c635925e2687400aa56b):
> https://github.com/apache/spark/tree/v2.4.5-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.5-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1340/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.5-rc2-docs/
>
> The list of bug fixes going into 2.4.5 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12346042
>
> This release is using the release script of the tag v2.4.5-rc2.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.5?
> ===
>
> The current list of open tickets targeted at 2.4.5 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 2.4.5
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Initial Decom PR for Spark 3?

2020-02-04 Thread Holden Karau
Hi Y’all,

I’ve got a K8s graceful decom PR (
https://github.com/apache/spark/pull/26440
 ) I’d love to try and get in for Spark 3, but I don’t want to push on it
if folks don’t think it’s worth it. I’ve been working on it since 2017 and
it was really close in November but then I had the crash and had to step
back for awhile.

It’s effectiveness is behind a feature flag and it’s been outstanding for
awhile so those points are in its favour. It does however change things in
core which is not great.

Cheers,

Holden
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Spark 3.0 branch cut and code freeze on Jan 31?

2020-02-04 Thread Dongjoon Hyun
Thank you, Shane! :D

Bests,
Dongjoon

On Tue, Feb 4, 2020 at 13:28 shane knapp ☠  wrote:

> all the 3.0 builds have been created and are currently churning away!
>
> (the failed builds were to a silly bug in the build scripts sneaking it's
> way back in, but that's resolved now)
>
> shane
>
> On Sat, Feb 1, 2020 at 6:16 PM Reynold Xin  wrote:
>
>> Note that branch-3.0 was cut. Please focus on testing, polish, and let's
>> get the release out!
>>
>>
>> On Wed, Jan 29, 2020 at 3:41 PM, Reynold Xin  wrote:
>>
>>> Just a reminder - code freeze is coming this Fri!
>>>
>>> There can always be exceptions, but those should be exceptions and
>>> discussed on a case by case basis rather than becoming the norm.
>>>
>>>
>>>
>>> On Tue, Dec 24, 2019 at 4:55 PM, Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Jan 31 sounds good to me.

 Just curious, do we allow some exception on code freeze? One thing came
 into my mind is that some feature could have multiple subtasks and part of
 subtasks have been merged and other subtask(s) are in reviewing. In this
 case do we allow these subtasks to have more days to get reviewed and
 merged later?

 Happy Holiday!

 Thanks,
 Jungtaek Lim (HeartSaVioR)

 On Wed, Dec 25, 2019 at 8:36 AM Takeshi Yamamuro 
 wrote:

> Looks nice, happy holiday, all!
>
> Bests,
> Takeshi
>
> On Wed, Dec 25, 2019 at 3:56 AM Dongjoon Hyun 
> wrote:
>
>> +1 for January 31st.
>>
>> Bests,
>> Dongjoon.
>>
>> On Tue, Dec 24, 2019 at 7:11 AM Xiao Li 
>> wrote:
>>
>>> Jan 31 is pretty reasonable. Happy Holidays!
>>>
>>> Xiao
>>>
>>> On Tue, Dec 24, 2019 at 5:52 AM Sean Owen  wrote:
>>>
 Yep, always happens. Is earlier realistic, like Jan 15? it's all
 arbitrary but indeed this has been in progress for a while, and 
 there's a
 downside to not releasing it, to making the gap to 3.0 larger.
 On my end I don't know of anything that's holding up a release; is
 it basically DSv2?

 BTW these are the items still targeted to 3.0.0, some of which may
 not have been legitimately tagged. It may be worth reviewing what's 
 still
 open and necessary, and what should be untargeted.

 SPARK-29768 nondeterministic expression fails column pruning
 SPARK-29345 Add an API that allows a user to define and observe
 arbitrary metrics on streaming queries
 SPARK-29348 Add observable metrics
 SPARK-29429 Support Prometheus monitoring natively
 SPARK-29577 Implement p-value simulation and unit tests for chi2
 test
 SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
 SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
 SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
 SPARK-28588 Build a SQL reference doc
 SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
 SPARK-28684 Hive module support JDK 11
 SPARK-28548 explain() shows wrong result for persisted DataFrames
 after some operations
 SPARK-28264 Revisiting Python / pandas UDF
 SPARK-28301 fix the behavior of table name resolution with
 multi-catalog
 SPARK-28155 do not leak SaveMode to file source v2
 SPARK-28103 Cannot infer filters from union table with empty local
 relation table properly
 SPARK-27986 Support Aggregate Expressions with filter
 SPARK-28024 Incorrect numeric values when out of range
 SPARK-27936 Support local dependency uploading from --py-files
 SPARK-27780 Shuffle server & client should be versioned to enable
 smoother upgrade
 SPARK-27714 Support Join Reorder based on Genetic Algorithm when
 the # of joined tables > 12
 SPARK-27471 Reorganize public v2 catalog API
 SPARK-27520 Introduce a global config system to replace
 hadoopConfiguration
 SPARK-24625 put all the backward compatible behavior change configs
 under spark.sql.legacy.*
 SPARK-24941 Add RDDBarrier.coalesce() function
 SPARK-25017 Add test suite for ContextBarrierState
 SPARK-25083 remove the type erasure hack in data source scan
 SPARK-25383 Image data source supports sample pushdown
 SPARK-27272 Enable blacklisting of node/executor on fetch failures
 by default
 SPARK-27296 Efficient User Defined Aggregators
 SPARK-25128 multiple simultaneous job submissions against k8s
 backend cause driver pods to hang
 SPARK-26664 Make DecimalType's minimum adjusted scale configurable
 SPARK-21559 Remove Mesos fine-grained mode
 SPARK-24942 Improve cluster resource management with jobs
 containing barrier stage
 SPARK-25914 Separate projectio

Re: Spark 3.0 branch cut and code freeze on Jan 31?

2020-02-04 Thread Xiao Li
Thank you, Shane!

Xiao

On Tue, Feb 4, 2020 at 2:16 PM Dongjoon Hyun 
wrote:

> Thank you, Shane! :D
>
> Bests,
> Dongjoon
>
> On Tue, Feb 4, 2020 at 13:28 shane knapp ☠  wrote:
>
>> all the 3.0 builds have been created and are currently churning away!
>>
>> (the failed builds were to a silly bug in the build scripts sneaking it's
>> way back in, but that's resolved now)
>>
>> shane
>>
>> On Sat, Feb 1, 2020 at 6:16 PM Reynold Xin  wrote:
>>
>>> Note that branch-3.0 was cut. Please focus on testing, polish, and let's
>>> get the release out!
>>>
>>>
>>> On Wed, Jan 29, 2020 at 3:41 PM, Reynold Xin 
>>> wrote:
>>>
 Just a reminder - code freeze is coming this Fri!

 There can always be exceptions, but those should be exceptions and
 discussed on a case by case basis rather than becoming the norm.



 On Tue, Dec 24, 2019 at 4:55 PM, Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> Jan 31 sounds good to me.
>
> Just curious, do we allow some exception on code freeze? One thing
> came into my mind is that some feature could have multiple subtasks and
> part of subtasks have been merged and other subtask(s) are in reviewing. 
> In
> this case do we allow these subtasks to have more days to get reviewed and
> merged later?
>
> Happy Holiday!
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Wed, Dec 25, 2019 at 8:36 AM Takeshi Yamamuro <
> linguin@gmail.com> wrote:
>
>> Looks nice, happy holiday, all!
>>
>> Bests,
>> Takeshi
>>
>> On Wed, Dec 25, 2019 at 3:56 AM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>>
>>> +1 for January 31st.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Tue, Dec 24, 2019 at 7:11 AM Xiao Li 
>>> wrote:
>>>
 Jan 31 is pretty reasonable. Happy Holidays!

 Xiao

 On Tue, Dec 24, 2019 at 5:52 AM Sean Owen  wrote:

> Yep, always happens. Is earlier realistic, like Jan 15? it's all
> arbitrary but indeed this has been in progress for a while, and 
> there's a
> downside to not releasing it, to making the gap to 3.0 larger.
> On my end I don't know of anything that's holding up a release; is
> it basically DSv2?
>
> BTW these are the items still targeted to 3.0.0, some of which may
> not have been legitimately tagged. It may be worth reviewing what's 
> still
> open and necessary, and what should be untargeted.
>
> SPARK-29768 nondeterministic expression fails column pruning
> SPARK-29345 Add an API that allows a user to define and observe
> arbitrary metrics on streaming queries
> SPARK-29348 Add observable metrics
> SPARK-29429 Support Prometheus monitoring natively
> SPARK-29577 Implement p-value simulation and unit tests for chi2
> test
> SPARK-28900 Test Pyspark, SparkR on JDK 11 with run-tests
> SPARK-28883 Fix a flaky test: ThriftServerQueryTestSuite
> SPARK-28717 Update SQL ALTER TABLE RENAME  to use TableCatalog API
> SPARK-28588 Build a SQL reference doc
> SPARK-28629 Capture the missing rules in HiveSessionStateBuilder
> SPARK-28684 Hive module support JDK 11
> SPARK-28548 explain() shows wrong result for persisted DataFrames
> after some operations
> SPARK-28264 Revisiting Python / pandas UDF
> SPARK-28301 fix the behavior of table name resolution with
> multi-catalog
> SPARK-28155 do not leak SaveMode to file source v2
> SPARK-28103 Cannot infer filters from union table with empty local
> relation table properly
> SPARK-27986 Support Aggregate Expressions with filter
> SPARK-28024 Incorrect numeric values when out of range
> SPARK-27936 Support local dependency uploading from --py-files
> SPARK-27780 Shuffle server & client should be versioned to enable
> smoother upgrade
> SPARK-27714 Support Join Reorder based on Genetic Algorithm when
> the # of joined tables > 12
> SPARK-27471 Reorganize public v2 catalog API
> SPARK-27520 Introduce a global config system to replace
> hadoopConfiguration
> SPARK-24625 put all the backward compatible behavior change
> configs under spark.sql.legacy.*
> SPARK-24941 Add RDDBarrier.coalesce() function
> SPARK-25017 Add test suite for ContextBarrierState
> SPARK-25083 remove the type erasure hack in data source scan
> SPARK-25383 Image data source supports sample pushdown
> SPARK-27272 Enable blacklisting of node/executor on fetch failures
> by default
> SPARK-27296 Efficient User Defined Aggregators
> SPARK-25128 multiple simultaneous job submissions against k8s
> backend cause driver pods to hang

Re: More publicly documenting the options under spark.sql.*

2020-02-04 Thread Hyukjin Kwon
FYI, PR was open at https://github.com/apache/spark/pull/27459. Thanks
Nicholas.
Hope guys find some time to take a look.

2020년 1월 28일 (화) 오전 8:15, Nicholas Chammas 님이
작성:

> I am! Thanks for the reference.
>
> On Thu, Jan 16, 2020 at 9:53 PM Hyukjin Kwon  wrote:
>
>> Nicholas, are you interested in taking a stab at this? You could refer
>> https://github.com/apache/spark/commit/60472dbfd97acfd6c4420a13f9b32bc9d84219f3
>>
>> 2020년 1월 17일 (금) 오전 8:48, Takeshi Yamamuro 님이 작성:
>>
>>> The idea looks nice. I think web documents always help end users.
>>>
>>> Bests,
>>> Takeshi
>>>
>>> On Fri, Jan 17, 2020 at 4:04 AM Shixiong(Ryan) Zhu <
>>> shixi...@databricks.com> wrote:
>>>
 "spark.sql("set -v")" returns a Dataset that has all non-internal SQL
 configurations. Should be pretty easy to automatically generate a SQL
 configuration page.

 Best Regards,
 Ryan


 On Wed, Jan 15, 2020 at 5:47 AM Hyukjin Kwon 
 wrote:

> I think automatically creating a configuration page isn't a bad idea
> because I think we deprecate and remove configurations which are not
> created via .internal() in SQLConf anyway.
>
> I already tried this automatic generation from the codes at SQL
> built-in functions and I'm pretty sure we can do the similar thing for
> configurations as well.
>
> We could perhaps mimic what hadoop does
> https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/core-default.xml
>
> On Wed, 15 Jan 2020, 10:46 Sean Owen,  wrote:
>
>> Some of it is intentionally undocumented, as far as I know, as an
>> experimental option that may change, or legacy, or safety valve flag.
>> Certainly anything that's marked an internal conf. (That does raise
>> the question of who it's for, if you have to read source to find it.)
>>
>> I don't know if we need to overhaul the conf system, but there may
>> indeed be some confs that could legitimately be documented. I don't
>> know which.
>>
>> On Tue, Jan 14, 2020 at 7:32 PM Nicholas Chammas
>>  wrote:
>> >
>> > I filed SPARK-30510 thinking that we had forgotten to document an
>> option, but it turns out that there's a whole bunch of stuff under
>> SQLConf.scala that has no public documentation under
>> http://spark.apache.org/docs.
>> >
>> > Would it be appropriate to somehow automatically generate a
>> documentation page from SQLConf.scala, as Hyukjin suggested on that 
>> ticket?
>> >
>> > Another thought that comes to mind is moving the config definitions
>> out of Scala and into a data format like YAML or JSON, and then sourcing
>> that both for SQLConf as well as for whatever documentation page we want 
>> to
>> generate. What do you think of that idea?
>> >
>> > Nick
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>


RE: [SQL] Is it worth it (and advisable) to implement native UDFs?

2020-02-04 Thread email
Is there any documentation/ sample about this besides the pull requests merged 
to spark core?

 

It seems that I need to create my custom functions under the package 
org.apache.spark.sql.* in order to be able to access some of the internal 
classes I saw in[1] such as Column[2]

 

Could you please confirm if that’s how it should be?

 

Thanks!

 

[1] https://github.com/apache/spark/pull/7214

[2] 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L37
 

 

From: Reynold Xin  
Sent: Wednesday, January 22, 2020 2:22 AM
To: em...@yeikel.com
Cc: dev@spark.apache.org
Subject: Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

 

  

 

If your UDF itself is very CPU intensive, it probably won't make that much of 
difference, because the UDF itself will dwarf the serialization/deserialization 
overhead.

 

If your UDF is cheap, it will help tremendously.

 

 

On Mon, Jan 20, 2020 at 6:33 PM, mailto:em...@yeikel.com> > 
wrote:

Hi, 

 

I read online[1] that for a best UDF performance it is possible to implement 
them using internal Spark expressions, and I also saw a couple of pull requests 
such as [2] and [3] where this was put to practice (not sure if for that reason 
or just to extend the API). 

 

We have an algorithm that computes a score similar to what the Levenshtein 
distance does and it takes about 30%-40% of the overall time of our job. We are 
looking for ways to improve it without adding more resources.

 

I was wondering if it would be advisable to implement it extending 
BinaryExpression like[1] and if it would result in any performance gains. 

 

Thanks for your help!

 

[1] 
https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11
 

[2] https://github.com/apache/spark/pull/7214

[3] https://github.com/apache/spark/pull/7236