Re: [VOTE] Release Spark 3.0.2 (RC1)

2021-02-17 Thread John Zhuge
+1 (non-binding)

On Tue, Feb 16, 2021 at 11:11 PM Maxim Gekk 
wrote:

> +1 (non-binding)
>
> On Wed, Feb 17, 2021 at 9:54 AM Wenchen Fan  wrote:
>
>> +1
>>
>> On Wed, Feb 17, 2021 at 1:43 PM Dongjoon Hyun 
>> wrote:
>>
>>> +1
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Tue, Feb 16, 2021 at 2:27 AM Herman van Hovell 
>>> wrote:
>>>
 +1

 On Tue, Feb 16, 2021 at 11:08 AM Hyukjin Kwon 
 wrote:

> +1
>
> 2021년 2월 16일 (화) 오후 5:10, Prashant Sharma 님이 작성:
>
>> +1
>>
>> On Tue, Feb 16, 2021 at 1:22 PM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 3.0.2.
>>>
>>> The vote is open until February 19th 9AM (PST) and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.0.2
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see
>>> https://spark.apache.org/
>>>
>>> The tag to be voted on is v3.0.2-rc1 (commit
>>> 648457905c4ea7d00e3d88048c63f360045f0714):
>>> https://github.com/apache/spark/tree/v3.0.2-rc1
>>>
>>> The release files, including signatures, digests, etc. can be found
>>> at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1366/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-docs/
>>>
>>> The list of bug fixes going into 3.0.2 can be found at the following
>>> URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12348739
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate,
>>> then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the
>>> Java/Scala
>>> you can add the staging repository to your projects resolvers and
>>> test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.0.2?
>>> ===
>>>
>>> The current list of open tickets targeted at 3.0.2 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for
>>> "Target Version/s" = 3.0.2
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>

-- 
John Zhuge


Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-17 Thread Wenchen Fan
I did a simple benchmark (adding two long values) to compare the
performance between
1. native expression
2. the current UDF
3. new UDF with individual parameters
4. new UDF with a row parameter (with the row object cached)
5. invoke a static method (to explore the possibility of speeding up
stateless UDF, not very related to the current topic)

The benchmark code can be found here
. The
result is

Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12 on Mac OS X 10.14.6
Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
UDF perf: Best Time(ms)   Avg Time(ms)
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

native add14206  14516
535 70.4  14.2   1.0X
udf add   24609  25271
898 40.6  24.6   0.6X
new udf add   18657  19096
726 53.6  18.7   0.8X
new row udf add   21128  22343
   1478 47.3  21.1   0.7X
static udf add16678  16887
278 60.0  16.7   0.9X


The new UDF with individual parameters is faster than the current UDF,
because the virtual function call is eliminated. It's also faster than the
row parameter version because of no overhead to set/get row fields.

I prefer the individual-parameters version, not only because of the
performance gain (10% is not a big win), but also because:
1. It's coherent with the current Scala/Java UDF API
2. It's simpler for developers to write simple UDFs (parameters are the
input columns directly).
3. It's possible to allow multiple java types for one catalyst type, e.g.
allowing both String and UTF8String, which is more flexible.

One major issue is not supporting varargs, but I'm not sure how
important this feature is. As I mentioned before, users can work around it
by accepting struct-type input and use the `struct` function to build the
input column. The current Scala/Java UDF doesn't support varargs either,
the same to Presto/Transport.

I'm fine to have an optional trait or flag to support varargs by accepting
InternalRow as the input, if there are user requests.

About debugging, I don't see a big issue here as the process of calling the
new UDF is very similar to the current Scala/Java UDF. Please let me know
if there are existing complaints about debugging the current Scala/Java
UDF. I think the row-parameter version is even harder to debug, as the
column binding happens in the user code (e.g. row.getLong(index)) which is
totally runtime, while the individual-parameters version has a
query-compile-time check to make sure the function signature matches the
input columns.

I can help to come up with detailed rules about null handling, type
matching, etc. for the individual-parameters UDF, if we all agree with this
direction.

Last but not least, calling methods via reflection (searching the method
handler only needs to be done once per task) is not that slow in modern
JVMs. Non-codegen is like 10x slower and I don't think a bit overhead in
Java reflection matters.



On Wed, Feb 17, 2021 at 3:07 PM Hyukjin Kwon  wrote:

> Just to make sure we don’t move past, I think we haven’t decided yet:
>
>- if we’ll replace the current proposal to Wenchen’s approach as the
>default
>- if we want to have Wenchen’s approach as an optional mix-in on the
>top of Ryan’s proposal (SupportsInvoke)
>
> From what I read, some people pointed out it as a replacement. Please
> correct me if I misread this discussion thread.
> As Dongjoon pointed out, it would be good to know rough ETA to make sure
> making progress in this, and people can compare more easily.
>
>
> FWIW, there’s the saying I like in the zen of Python
> :
>
> There should be one— and preferably only one —obvious way to do it.
>
> If multiple approaches have the way for developers to do the (almost) same
> thing, I would prefer to avoid it.
>
> In addition, I would prefer to focus on what Spark does by default first.
>
>
> 2021년 2월 17일 (수) 오후 2:33, Dongjoon Hyun 님이 작성:
>
>> Hi, Wenchen.
>>
>> This thread seems to get enough attention. Also, I'm expecting more and
>> more if we have this on the `master` branch because we are developing
>> together.
>>
>> > Spark SQL has many active contributors/committers and this thread
>> doesn't get much attention yet.
>>
>> So, what's your ETA from now?
>>
>> > I think the problem here is we were discussing some very detailed
>> things without actual code.
>> > I'll implement my idea after the holiday and then we can have more
>> effective discussions.
>> > We can also do benchmar

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-17 Thread Dongjoon Hyun
Thank you so much for sharing the progress, Wenchen! Also, thank you,
Hyukjin.

Bests,
Dongjoon.

On Wed, Feb 17, 2021 at 2:49 AM Wenchen Fan  wrote:

> I did a simple benchmark (adding two long values) to compare the
> performance between
> 1. native expression
> 2. the current UDF
> 3. new UDF with individual parameters
> 4. new UDF with a row parameter (with the row object cached)
> 5. invoke a static method (to explore the possibility of speeding up
> stateless UDF, not very related to the current topic)
>
> The benchmark code can be found here
> . The
> result is
>
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_161-b12 on Mac OS X 10.14.6
> Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
> UDF perf: Best Time(ms)   Avg Time(ms)
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
>
> 
> native add14206  14516
> 535 70.4  14.2   1.0X
> udf add   24609  25271
> 898 40.6  24.6   0.6X
> new udf add   18657  19096
> 726 53.6  18.7   0.8X
> new row udf add   21128  22343
>1478 47.3  21.1   0.7X
> static udf add16678  16887
> 278 60.0  16.7   0.9X
>
>
> The new UDF with individual parameters is faster than the current UDF,
> because the virtual function call is eliminated. It's also faster than the
> row parameter version because of no overhead to set/get row fields.
>
> I prefer the individual-parameters version, not only because of the
> performance gain (10% is not a big win), but also because:
> 1. It's coherent with the current Scala/Java UDF API
> 2. It's simpler for developers to write simple UDFs (parameters are the
> input columns directly).
> 3. It's possible to allow multiple java types for one catalyst type, e.g.
> allowing both String and UTF8String, which is more flexible.
>
> One major issue is not supporting varargs, but I'm not sure how
> important this feature is. As I mentioned before, users can work around it
> by accepting struct-type input and use the `struct` function to build the
> input column. The current Scala/Java UDF doesn't support varargs either,
> the same to Presto/Transport.
>
> I'm fine to have an optional trait or flag to support varargs by accepting
> InternalRow as the input, if there are user requests.
>
> About debugging, I don't see a big issue here as the process of calling
> the new UDF is very similar to the current Scala/Java UDF. Please let me
> know if there are existing complaints about debugging the current
> Scala/Java UDF. I think the row-parameter version is even harder to debug,
> as the column binding happens in the user code (e.g. row.getLong(index))
> which is totally runtime, while the individual-parameters version has a
> query-compile-time check to make sure the function signature matches the
> input columns.
>
> I can help to come up with detailed rules about null handling, type
> matching, etc. for the individual-parameters UDF, if we all agree with this
> direction.
>
> Last but not least, calling methods via reflection (searching the method
> handler only needs to be done once per task) is not that slow in modern
> JVMs. Non-codegen is like 10x slower and I don't think a bit overhead in
> Java reflection matters.
>
>
>
> On Wed, Feb 17, 2021 at 3:07 PM Hyukjin Kwon  wrote:
>
>> Just to make sure we don’t move past, I think we haven’t decided yet:
>>
>>- if we’ll replace the current proposal to Wenchen’s approach as the
>>default
>>- if we want to have Wenchen’s approach as an optional mix-in on the
>>top of Ryan’s proposal (SupportsInvoke)
>>
>> From what I read, some people pointed out it as a replacement. Please
>> correct me if I misread this discussion thread.
>> As Dongjoon pointed out, it would be good to know rough ETA to make sure
>> making progress in this, and people can compare more easily.
>>
>>
>> FWIW, there’s the saying I like in the zen of Python
>> :
>>
>> There should be one— and preferably only one —obvious way to do it.
>>
>> If multiple approaches have the way for developers to do the (almost)
>> same thing, I would prefer to avoid it.
>>
>> In addition, I would prefer to focus on what Spark does by default first.
>>
>>
>> 2021년 2월 17일 (수) 오후 2:33, Dongjoon Hyun 님이 작성:
>>
>>> Hi, Wenchen.
>>>
>>> This thread seems to get enough attention. Also, I'm expecting more and
>>> more if we have this on the `master` branch because we are developing
>>> together.
>>>
>>> > Spark SQL has many active contributors/committers and this thr

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-17 Thread Ryan Blue
Thanks, Hyukjin. I think that's a fair summary. And I agree with the idea
that we should focus on what Spark will do by default.

I think we should focus on the proposal, for two reasons: first, there is a
straightforward path to incorporate Wenchen's suggestion via
`SupportsInvoke`, and second, the proposal is more complete: it defines a
solution for many concerns like loading a function and finding out what
types to use -- not just how to call code -- and supports more use cases
like varargs functions. I think we can continue to discuss the rest of the
proposal and be confident that we can support an invoke code path where it
makes sense.

Does everyone agree? If not, I think we would need to solve a lot of the
challenges that I initially brought up with the invoke idea. It seems like
a good way to call a function, but needs a real proposal behind it if we
don't use it via `SupportsInvoke` in the current proposal.

On Tue, Feb 16, 2021 at 11:07 PM Hyukjin Kwon  wrote:

> Just to make sure we don’t move past, I think we haven’t decided yet:
>
>- if we’ll replace the current proposal to Wenchen’s approach as the
>default
>- if we want to have Wenchen’s approach as an optional mix-in on the
>top of Ryan’s proposal (SupportsInvoke)
>
> From what I read, some people pointed out it as a replacement. Please
> correct me if I misread this discussion thread.
> As Dongjoon pointed out, it would be good to know rough ETA to make sure
> making progress in this, and people can compare more easily.
>
>
> FWIW, there’s the saying I like in the zen of Python
> :
>
> There should be one— and preferably only one —obvious way to do it.
>
> If multiple approaches have the way for developers to do the (almost) same
> thing, I would prefer to avoid it.
>
> In addition, I would prefer to focus on what Spark does by default first.
>
>
> 2021년 2월 17일 (수) 오후 2:33, Dongjoon Hyun 님이 작성:
>
>> Hi, Wenchen.
>>
>> This thread seems to get enough attention. Also, I'm expecting more and
>> more if we have this on the `master` branch because we are developing
>> together.
>>
>> > Spark SQL has many active contributors/committers and this thread
>> doesn't get much attention yet.
>>
>> So, what's your ETA from now?
>>
>> > I think the problem here is we were discussing some very detailed
>> things without actual code.
>> > I'll implement my idea after the holiday and then we can have more
>> effective discussions.
>> > We can also do benchmarks and get some real numbers.
>> > In the meantime, we can continue to discuss other parts of this
>> proposal, and make a prototype if possible.
>>
>> I'm looking forward to seeing your PR. I hope we can conclude this thread
>> and have at least one implementation in the `master` branch this month
>> (February).
>> If you need more time (one month or longer), why don't we have Ryan's
>> suggestion in the `master` branch first and benchmark with your PR later
>> during Apache Spark 3.2 timeframe.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Tue, Feb 16, 2021 at 9:26 AM Ryan Blue 
>> wrote:
>>
>>> Andrew,
>>>
>>> The proposal already includes an API for aggregate functions and I think
>>> we would want to implement those right away.
>>>
>>> Processing ColumnBatch is something we can easily extend the interfaces
>>> to support, similar to Wenchen's suggestion. The important thing right now
>>> is to agree on some basic functionality: how to look up functions and what
>>> the simple API should be. Like the TableCatalog interfaces, we will layer
>>> on more support through optional interfaces like `SupportsInvoke` or
>>> `SupportsColumnBatch`.
>>>
>>> On Tue, Feb 16, 2021 at 9:00 AM Andrew Melo 
>>> wrote:
>>>
 Hello Ryan,

 This proposal looks very interesting. Would future goals for this
 functionality include both support for aggregation functions, as well
 as support for processing ColumnBatch-es (instead of Row/InternalRow)?

 Thanks
 Andrew

 On Mon, Feb 15, 2021 at 12:44 PM Ryan Blue 
 wrote:
 >
 > Thanks for the positive feedback, everyone. It sounds like there is a
 clear path forward for calling functions. Even without a prototype, the
 `invoke` plans show that Wenchen's suggested optimization can be done, and
 incorporating it as an optional extension to this proposal solves many of
 the unknowns.
 >
 > With that area now understood, is there any discussion about other
 parts of the proposal, besides the function call interface?
 >
 > On Fri, Feb 12, 2021 at 10:40 PM Chao Sun  wrote:
 >>
 >> This is an important feature which can unblock several other
 projects including bucket join support for DataSource v2, complete support
 for enforcing DataSource v2 distribution requirements on the write path,
 etc. I like Ryan's proposals which look simple and elegant, with nice
 support on function overloading and variadic a

Re: [VOTE] Release Spark 3.0.2 (RC1)

2021-02-17 Thread Takeshi Yamamuro
+1

I've looked around the jira tickets and I couldn't find any blocker in the
SQL part.
Also, I ran the tests on aws env and I couldn't find any critical error
there, too.


On Wed, Feb 17, 2021 at 5:21 PM John Zhuge  wrote:

> +1 (non-binding)
>
> On Tue, Feb 16, 2021 at 11:11 PM Maxim Gekk 
> wrote:
>
>> +1 (non-binding)
>>
>> On Wed, Feb 17, 2021 at 9:54 AM Wenchen Fan  wrote:
>>
>>> +1
>>>
>>> On Wed, Feb 17, 2021 at 1:43 PM Dongjoon Hyun 
>>> wrote:
>>>
 +1

 Bests,
 Dongjoon.


 On Tue, Feb 16, 2021 at 2:27 AM Herman van Hovell <
 her...@databricks.com> wrote:

> +1
>
> On Tue, Feb 16, 2021 at 11:08 AM Hyukjin Kwon 
> wrote:
>
>> +1
>>
>> 2021년 2월 16일 (화) 오후 5:10, Prashant Sharma 님이
>> 작성:
>>
>>> +1
>>>
>>> On Tue, Feb 16, 2021 at 1:22 PM Dongjoon Hyun <
>>> dongjoon.h...@gmail.com> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 3.0.2.

 The vote is open until February 19th 9AM (PST) and passes if a
 majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 3.0.2
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 https://spark.apache.org/

 The tag to be voted on is v3.0.2-rc1 (commit
 648457905c4ea7d00e3d88048c63f360045f0714):
 https://github.com/apache/spark/tree/v3.0.2-rc1

 The release files, including signatures, digests, etc. can be found
 at:
 https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:

 https://repository.apache.org/content/repositories/orgapachespark-1366/

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-docs/

 The list of bug fixes going into 3.0.2 can be found at the
 following URL:
 https://issues.apache.org/jira/projects/SPARK/versions/12348739

 FAQ

 =
 How can I help test this release?
 =

 If you are a Spark user, you can help us test this release by taking
 an existing Spark workload and running on this release candidate,
 then
 reporting any regressions.

 If you're working in PySpark you can set up a virtual env and
 install
 the current RC and see if anything important breaks, in the
 Java/Scala
 you can add the staging repository to your projects resolvers and
 test
 with the RC (make sure to clean up the artifact cache before/after
 so
 you don't end up building with a out of date RC going forward).

 ===
 What should happen to JIRA tickets still targeting 3.0.2?
 ===

 The current list of open tickets targeted at 3.0.2 can be found at:
 https://issues.apache.org/jira/projects/SPARK and search for
 "Target Version/s" = 3.0.2

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility
 should
 be worked on immediately. Everything else please retarget to an
 appropriate release.

 ==
 But my bug isn't fixed?
 ==

 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from the previous
 release. That being said, if there is something which is a
 regression
 that has not been correctly targeted please ping me or a committer
 to
 help target the issue.

>>>
>
> --
> John Zhuge
>


-- 
---
Takeshi Yamamuro


Re: [VOTE] Release Spark 3.0.2 (RC1)

2021-02-17 Thread Sean Owen
I think I'm +1 on this, in that I don't see any more test failures than I
usually do, and I think they're due to my local env, but is anyone seeing
these failures?
- includes jars passed in through --jars *** FAILED ***
  Process returned with exit code 1. See the log4j logs for more detail.
(SparkSubmitSuite.scala:1517)
- includes jars passed in through --packages *** FAILED ***
  Process returned with exit code 1. See the log4j logs for more detail.
(SparkSubmitSuite.scala:1517)
- includes jars passed through spark.jars.packages and
spark.jars.repositories *** FAILED ***
  Process returned with exit code 1. See the log4j logs for more detail.
(SparkSubmitSuite.scala:1517)
- correctly builds R packages included in a jar with --packages !!! IGNORED
!!!
- include an external JAR in SparkR *** FAILED ***
  Process returned with exit code 1. See the log4j logs for more detail.
(SparkSubmitSuite.scala:1517)



- SPARK-8368: includes jars passed in through --jars *** FAILED ***
  spark-submit returned with exit code 1.
  Command line: './bin/spark-submit' '--class'
'org.apache.spark.sql.hive.SparkSubmitClassLoaderTest' '--name'
'SparkSubmitClassLoaderTest' '--master' 'local-cluster[2,1,1024]' '--conf'
'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false'
'--driver-java-options' '-Dderby.system.durability=test' '--jars'
'file:/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-d5238bae-f0c8-4e26-8e0d-e7fc3a830de4/testJar-1613607380770.jar,file:/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-5007fa06-28c3-4816-afe0-09f5885a201c/testJar-1613607380989.jar,/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-59b75d08-8ea5-4fe9-a97b-84866093ad3a/hive-contrib-2.3.7.jar,/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-59b75d08-8ea5-4fe9-a97b-84866093ad3a/hive-hcatalog-core-2.3.7.jar'
'file:/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-27131c51-8387-455a-a23d-e5c41e5448a3/testJar-1613607380546.jar'
'SparkSubmitClassA' 'SparkSubmitClassB'



 external shuffle service *** FAILED ***
  FAILED did not equal FINISHED (stdout/stderr was not captured)
(BaseYarnClusterSuite.scala:199)


On Tue, Feb 16, 2021 at 1:52 AM Dongjoon Hyun 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 3.0.2.
>
> The vote is open until February 19th 9AM (PST) and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.0.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.0.2-rc1 (commit
> 648457905c4ea7d00e3d88048c63f360045f0714):
> https://github.com/apache/spark/tree/v3.0.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1366/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-docs/
>
> The list of bug fixes going into 3.0.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12348739
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.0.2?
> ===
>
> The current list of open tickets targeted at 3.0.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.0.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>


Re: [VOTE] Release Spark 3.0.2 (RC1)

2021-02-17 Thread Dongjoon Hyun
I didn't see them. Could you describe your environment: OS, Java,
Maven/SBT, profiles?

On Wed, Feb 17, 2021 at 6:26 PM Sean Owen  wrote:

> I think I'm +1 on this, in that I don't see any more test failures than I
> usually do, and I think they're due to my local env, but is anyone seeing
> these failures?
> - includes jars passed in through --jars *** FAILED ***
>   Process returned with exit code 1. See the log4j logs for more detail.
> (SparkSubmitSuite.scala:1517)
> - includes jars passed in through --packages *** FAILED ***
>   Process returned with exit code 1. See the log4j logs for more detail.
> (SparkSubmitSuite.scala:1517)
> - includes jars passed through spark.jars.packages and
> spark.jars.repositories *** FAILED ***
>   Process returned with exit code 1. See the log4j logs for more detail.
> (SparkSubmitSuite.scala:1517)
> - correctly builds R packages included in a jar with --packages !!!
> IGNORED !!!
> - include an external JAR in SparkR *** FAILED ***
>   Process returned with exit code 1. See the log4j logs for more detail.
> (SparkSubmitSuite.scala:1517)
>
>
>
> - SPARK-8368: includes jars passed in through --jars *** FAILED ***
>   spark-submit returned with exit code 1.
>   Command line: './bin/spark-submit' '--class'
> 'org.apache.spark.sql.hive.SparkSubmitClassLoaderTest' '--name'
> 'SparkSubmitClassLoaderTest' '--master' 'local-cluster[2,1,1024]' '--conf'
> 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false'
> '--driver-java-options' '-Dderby.system.durability=test' '--jars'
> 'file:/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-d5238bae-f0c8-4e26-8e0d-e7fc3a830de4/testJar-1613607380770.jar,file:/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-5007fa06-28c3-4816-afe0-09f5885a201c/testJar-1613607380989.jar,/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-59b75d08-8ea5-4fe9-a97b-84866093ad3a/hive-contrib-2.3.7.jar,/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-59b75d08-8ea5-4fe9-a97b-84866093ad3a/hive-hcatalog-core-2.3.7.jar'
> 'file:/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-27131c51-8387-455a-a23d-e5c41e5448a3/testJar-1613607380546.jar'
> 'SparkSubmitClassA' 'SparkSubmitClassB'
>
>
>
>  external shuffle service *** FAILED ***
>   FAILED did not equal FINISHED (stdout/stderr was not captured)
> (BaseYarnClusterSuite.scala:199)
>
>
> On Tue, Feb 16, 2021 at 1:52 AM Dongjoon Hyun 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.0.2.
>>
>> The vote is open until February 19th 9AM (PST) and passes if a majority
>> +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.0.2
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org/
>>
>> The tag to be voted on is v3.0.2-rc1 (commit
>> 648457905c4ea7d00e3d88048c63f360045f0714):
>> https://github.com/apache/spark/tree/v3.0.2-rc1
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1366/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-docs/
>>
>> The list of bug fixes going into 3.0.2 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12348739
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.0.2?
>> ===
>>
>> The current list of open tickets targeted at 3.0.2 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.0.2
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not h

Re: [VOTE] Release Spark 3.0.2 (RC1)

2021-02-17 Thread Sean Owen
I'm on Ubuntu 20, Java 8, Maven, with most every profile enabled (Hive,
YARN, Mesos, K8S, SparkR, etc). I think it's probably transient or specific
to my env; just checking if anyone else sees this. Obviously the main test
builds do not fail on Jenkins.

On Wed, Feb 17, 2021 at 10:47 PM Dongjoon Hyun 
wrote:

> I didn't see them. Could you describe your environment: OS, Java,
> Maven/SBT, profiles?
>
> On Wed, Feb 17, 2021 at 6:26 PM Sean Owen  wrote:
>
>> I think I'm +1 on this, in that I don't see any more test failures than I
>> usually do, and I think they're due to my local env, but is anyone seeing
>> these failures?
>> - includes jars passed in through --jars *** FAILED ***
>>   Process returned with exit code 1. See the log4j logs for more detail.
>> (SparkSubmitSuite.scala:1517)
>> - includes jars passed in through --packages *** FAILED ***
>>   Process returned with exit code 1. See the log4j logs for more detail.
>> (SparkSubmitSuite.scala:1517)
>> - includes jars passed through spark.jars.packages and
>> spark.jars.repositories *** FAILED ***
>>   Process returned with exit code 1. See the log4j logs for more detail.
>> (SparkSubmitSuite.scala:1517)
>> - correctly builds R packages included in a jar with --packages !!!
>> IGNORED !!!
>> - include an external JAR in SparkR *** FAILED ***
>>   Process returned with exit code 1. See the log4j logs for more detail.
>> (SparkSubmitSuite.scala:1517)
>>
>>
>>
>> - SPARK-8368: includes jars passed in through --jars *** FAILED ***
>>   spark-submit returned with exit code 1.
>>   Command line: './bin/spark-submit' '--class'
>> 'org.apache.spark.sql.hive.SparkSubmitClassLoaderTest' '--name'
>> 'SparkSubmitClassLoaderTest' '--master' 'local-cluster[2,1,1024]' '--conf'
>> 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false'
>> '--driver-java-options' '-Dderby.system.durability=test' '--jars'
>> 'file:/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-d5238bae-f0c8-4e26-8e0d-e7fc3a830de4/testJar-1613607380770.jar,file:/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-5007fa06-28c3-4816-afe0-09f5885a201c/testJar-1613607380989.jar,/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-59b75d08-8ea5-4fe9-a97b-84866093ad3a/hive-contrib-2.3.7.jar,/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-59b75d08-8ea5-4fe9-a97b-84866093ad3a/hive-hcatalog-core-2.3.7.jar'
>> 'file:/mnt/data/testing/spark-3.0.2/sql/hive/target/tmp/spark-27131c51-8387-455a-a23d-e5c41e5448a3/testJar-1613607380546.jar'
>> 'SparkSubmitClassA' 'SparkSubmitClassB'
>>
>>
>>
>>  external shuffle service *** FAILED ***
>>   FAILED did not equal FINISHED (stdout/stderr was not captured)
>> (BaseYarnClusterSuite.scala:199)
>>
>>
>> On Tue, Feb 16, 2021 at 1:52 AM Dongjoon Hyun 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 3.0.2.
>>>
>>> The vote is open until February 19th 9AM (PST) and passes if a majority
>>> +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.0.2
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>
>>> The tag to be voted on is v3.0.2-rc1 (commit
>>> 648457905c4ea7d00e3d88048c63f360045f0714):
>>> https://github.com/apache/spark/tree/v3.0.2-rc1
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1366/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-docs/
>>>
>>> The list of bug fixes going into 3.0.2 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12348739
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.0.2?
>>> ===
>>>
>>> The current list of open tickets targeted at 3.0.2 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and sea

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-17 Thread Wenchen Fan
I don't see any objections to the rest of the proposal (loading functions
from the catalog, function binding stuff, etc.) and I assume everyone is OK
with it. We can commit that part first.

Currently, the discussion focuses on the `ScalarFunction` API, where I
think it's better to directly take the input columns as the UDF parameter,
instead of wrapping the input columns with InternalRow and taking the
InternalRow as the UDF parameter. It's not only for better performance, but
also for ease of use. For example, it's easier for the UDF developer to
write `input1 + input2` than `inputRow.getLong(0) + inputRow.getLong(1)`,
as they don't need to specify the type and index by themselves (getLong(0))
which is error-prone.

It does push more work to the Spark side, but I think it's worth it if
implementing UDF gets easier. I don't think the work is very challenging,
as we can leverage the infra we built for the expression encoder.

I think it's also important to look at the UDF API from the user's
perspective (UDF developers). How do you like the UDF API without
considering how Spark can support it? Do you prefer the
individual-parameters version or the row-parameter version?

To move forward, how about we implement the function loading and binding
first? Then we can have PRs for both the individual-parameters (I can take
it) and row-parameter approaches, if we still can't reach a consensus at
that time and need to see all the details.

On Thu, Feb 18, 2021 at 4:48 AM Ryan Blue  wrote:

> Thanks, Hyukjin. I think that's a fair summary. And I agree with the idea
> that we should focus on what Spark will do by default.
>
> I think we should focus on the proposal, for two reasons: first, there is
> a straightforward path to incorporate Wenchen's suggestion via
> `SupportsInvoke`, and second, the proposal is more complete: it defines a
> solution for many concerns like loading a function and finding out what
> types to use -- not just how to call code -- and supports more use cases
> like varargs functions. I think we can continue to discuss the rest of the
> proposal and be confident that we can support an invoke code path where it
> makes sense.
>
> Does everyone agree? If not, I think we would need to solve a lot of the
> challenges that I initially brought up with the invoke idea. It seems like
> a good way to call a function, but needs a real proposal behind it if we
> don't use it via `SupportsInvoke` in the current proposal.
>
> On Tue, Feb 16, 2021 at 11:07 PM Hyukjin Kwon  wrote:
>
>> Just to make sure we don’t move past, I think we haven’t decided yet:
>>
>>- if we’ll replace the current proposal to Wenchen’s approach as the
>>default
>>- if we want to have Wenchen’s approach as an optional mix-in on the
>>top of Ryan’s proposal (SupportsInvoke)
>>
>> From what I read, some people pointed out it as a replacement. Please
>> correct me if I misread this discussion thread.
>> As Dongjoon pointed out, it would be good to know rough ETA to make sure
>> making progress in this, and people can compare more easily.
>>
>>
>> FWIW, there’s the saying I like in the zen of Python
>> :
>>
>> There should be one— and preferably only one —obvious way to do it.
>>
>> If multiple approaches have the way for developers to do the (almost)
>> same thing, I would prefer to avoid it.
>>
>> In addition, I would prefer to focus on what Spark does by default first.
>>
>>
>> 2021년 2월 17일 (수) 오후 2:33, Dongjoon Hyun 님이 작성:
>>
>>> Hi, Wenchen.
>>>
>>> This thread seems to get enough attention. Also, I'm expecting more and
>>> more if we have this on the `master` branch because we are developing
>>> together.
>>>
>>> > Spark SQL has many active contributors/committers and this thread
>>> doesn't get much attention yet.
>>>
>>> So, what's your ETA from now?
>>>
>>> > I think the problem here is we were discussing some very detailed
>>> things without actual code.
>>> > I'll implement my idea after the holiday and then we can have more
>>> effective discussions.
>>> > We can also do benchmarks and get some real numbers.
>>> > In the meantime, we can continue to discuss other parts of this
>>> proposal, and make a prototype if possible.
>>>
>>> I'm looking forward to seeing your PR. I hope we can conclude this
>>> thread and have at least one implementation in the `master` branch this
>>> month (February).
>>> If you need more time (one month or longer), why don't we have Ryan's
>>> suggestion in the `master` branch first and benchmark with your PR later
>>> during Apache Spark 3.2 timeframe.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Tue, Feb 16, 2021 at 9:26 AM Ryan Blue 
>>> wrote:
>>>
 Andrew,

 The proposal already includes an API for aggregate functions and I
 think we would want to implement those right away.

 Processing ColumnBatch is something we can easily extend the interfaces
 to support, similar to Wenchen's suggestion